Hello Beastie and Google Summer of Code!
01-06-2024
Table of Contents
I have been admitted to be a part of Google Summer of Code (GSoC) 2024! [1]
Participating in GSoC has been on my radar for many years but I always thought, oh I’ll do it next summer, well that summer is now! :D
Background
The admission process to GSoC is as follows: Organizations publish some suggested projects with possible mentors, students get in touch with said mentors or come up with their own project and convince someone to mentor them. Then voting takes place and the top N applicants for the organizations are accepted where N is the numbers of slots Google has allocated for that organization. So I found a project that I thought sounded interesting, got in touch with the mentor and sent in a proposal [2] through google’s portal, the project was posted on the FreeBSD GSoC ideas page [3]. The project is to port SIMD enhanced string functions in libc from amd64 (x86_64) to arm64 (Aarch64).
A great incentive to the program is that you get the chance to be mentored by member(s) of the community. I have the pleasure to be mentored by two wonderful people, Robert Clausecker <fuz> and Ed Maste <emaste>. Another GSoC contributor is porting the same algorithms to RISC-V, he is basing his implementation on the base RISC-V ISA using SIMD Within A Register (SWAR) techniques, and thus not having any dependency on processor extensions such as the recently ratified 1.0 RISC-V Vector Extension for which almost no hardware is currently available (and no hardware running FreeBSD). His blog documenting his adventures is available at [4].
As for me, I’ve been using FreeBSD for a few years now, before that I used Linux but after frustration over lack of good documentation and a fragmented system I switched and haven’t looked back. I still enjoy the linux kernel but userland and distros is something I just want out of my way.
But this is not an article about why FreeBSD is superior to Linux, if it even is (yes, in some regards, but ymmv). But rather about my project for the summer.
Why write assembly
Most libc functions on other platforms already benefit from being handwritten in assembly, both scalar and SIMD variants. Using SIMD instructions for string functions are particularly unfit for a autovectorizing compiler as we make atypical use of SIMD instructions. For the scalar implementations we may use some Bit Twiddling Hacks such as those on Sean Eron Anderson’s site [5].
Compilers also struggle reasoning to decide which operations to do in GPRs and which to do in vector registers. Register allocation is also a problem for amd64 and the compiler may spill into the stack whereas for handwritten assembly you would be left with registers to spare. This is not really as extreme of a case on arm64 as we have way more registers than amd64 to play around with.
Another compelling reason to have performance critical libc functions written in assembly is that all programs that link against libc will benefit from these improvements. Although this will put some additional pressure on me as an implementer as the code cant break other peoples programs just because they abused libc in interesting ways. An example of that is how memcmp on FreeBSD differs from the ISO/IEC 9899:1999 requirement’s. In particular FreeBSD documents that memcmp returns the difference between the first two mismatching characters, as opposed to merely returning a negative/positive integer or zero.
But my project will solely deal with using Arm NEON instructions to simd-ify the string functions, although bit-twiddling GPRs does sound enticing.
Architectural levels
The AMD64 SysV ABI supplement defines the following architecture levels, where
for FreeBSD we have implementations for most of the string functions in libc for
scalar and baseline. Users are able to choose which level of enhancements to use
using the ARCHLEVEL
flag. A complete list of enhanced functions are available
in the simd(7) manpage [6].
scalar scalar enhancements only (no SIMD)
baseline cmov, cx8, x87 FPU, fxsr, MMX, osfxsr, SSE, SSE2
x86-64-v2 cx16, lahf/sahf, popcnt, SSE3, SSSE3, SSE4.1, SSE4.2
x86-64-v3 AVX, AVX2, BMI1, BMI2, F16C, FMA, lzcnt, movbe, osxsave
x86-64-v4 AVX-512F/BW/CD/DQ/VL
amd64 strlen implementation
The amd64 strlen implementation consists of two parts, a scalar one (using bit-twiddling) and a Vectorized one (baseline). I’ll focus on the SIMD one here, the interested reader can navigate to /usr/src/lib/libc/amd64/string or point their browser to [7] for the scalar implementation.
ARCHENTRY(strlen, baseline)
mov %rdi, %rcx
pxor %xmm1, %xmm1
and $~0xf, %rdi # align string
pcmpeqb (%rdi), %xmm1 # compare head (with junk before string)
mov %rcx, %rsi # string pointer copy for later
and $0xf, %ecx # amount of bytes rdi is past 16 byte alignment
pmovmskb %xmm1, %eax
add $32, %rdi # advance to next iteration
shr %cl, %eax # clear out matches in junk bytes
test %eax, %eax # any match? (can't use ZF from SHR as CL=0 is possible)
jnz 2f
ALIGN_TEXT
1: pxor %xmm1, %xmm1
pcmpeqb -16(%rdi), %xmm1 # find NUL bytes
pmovmskb %xmm1, %eax
test %eax, %eax # were any NUL bytes present?
jnz 3f
/* the same unrolled once more */
pxor %xmm1, %xmm1
pcmpeqb (%rdi), %xmm1
pmovmskb %xmm1, %eax
add $32, %rdi # advance to next iteration
test %eax, %eax
jz 1b
/* match found in loop body */
sub $16, %rdi # undo half the advancement
3: tzcnt %eax, %eax # find the first NUL byte
sub %rsi, %rdi # string length until beginning of (%rdi)
lea -16(%rdi, %rax, 1), %rax # that plus loc. of NUL byte: full string length
ret
/* match found in head */
2: tzcnt %eax, %eax # compute string length
ret
ARCHEND(strlen, baseline)
Most of these instructions aren’t anything odd, MOV
,XOR
,AND
,SHR
, but
what stands out is PCMPEQB
and PMOVMSKV
.
PMOVMSKB
[8] is one of the most useful instructions for finding where the
the index of our NULL
character. The string functions in libc obviously
function on C style string so they are NULL
terminated. So what we do there is
a compare followed by figuring out where the match was.
But heres the kicker, there is no PMOVMSKB
instruction for Aarch64 which has
caused a whole lot of headache. Im basing this on the amount of posts online
regarding substitutions for PMOVMSKB
on Aarch64 whereas assembly instructions
are otherwise often rarely complained about. [9][10][11]
Substituting missing instructions
The most promising substitution to PMOVMSKB
appears to be SHRN
.
With it we can take a 128-bit vector, shift by #imm and truncate to 8 bits.
So basically end up with a mask of either chunks of all 0’s or all 1’s, then by
truncation we end up with single but halfbytes which correspond to whether we
had a match or not. Hopefully thats comes out clearly, otherwise there is an
excellent video courtesy of [10] which shows it in action.
strlen PoC
Here is an evolution of my attempt of porting strlen to Aarch64.
They also come in the form of a git repository [12] where I keep my experiments
before theyre ready to be integrated into libc at my fork of freebsd-src that I
keep on github. [13]
It’s also good to know while reading this code that registers x9-x15
are
“corruptible” meaning that a function can change them to whatever without having
to restore them afterwards. Registers d0-d7
are “parameter and results
registers”. So according to the ARM procedure call standard
, x0
etc. is
where the input to a function is stored and you can do whatever you want with
x9-x15
without breaking anything.
Simple strlen
The most simple variant simply checks the first chunk without any loop. This simple example will only give valid results for short strings which are 16 byte aligned. That is that the data begins on a memory address that is a multiple of 16. You can create such a string to test it out like this:
#include <stdalign.h>
#include <stdio.h>
#include <string.h>
extern size_t _strlen(const char * ptr);
int
main() {
alignas(16) char string[] = "str";
printf("strlen: %zu\n", _strlen(string));
}
Now for the simple strlen implementation.
Note that <machine/asm.h>
has nice little macros ENTRY()
and END()
which are used throughout.
ENTRY(_strlen)
BIC x10,x0,#0xf // alignment
LDR q0,[x10] // load input to Vector register
CMEQ v0.16b,v0.16b,#0 // look for 0's
SHRN v0.8b,v0.8h,#4 // ^
FMOV x0,d0 // move to GPR
RBIT x0,x0 // reverse bits as NEON has no ctz
CLZ x0,x0 // count leading zeros
LSR x0,x0,#2 // get offset index
RET
END(_strlen)
Naïve implementation
Now for a simple but naïve solution to creating a loop and handling strings which are not already 16 byte aligned. We simply calculate the offset to the nearest 16 byte boundary and traverse to the boundary with a Scalar implementation and then turn to the SIMD variant.
ENTRY(_strlen)
BIC x10,x0,#0xf
AND x9,x0,#0xf
MOV x11,#0
CBZ x9,.Laligned_loop
.Lunaligned_start:
LDR x5,[x10]
ADD x10,x10,#1
CBZ x5,.Lfound_null
SUB x9,x9,#1
CBNZ x9,.Lunaligned_start
B .Laligned_loop
.Lnext_it:
ADD x11,x11,#16
ADD x10,x10,#16
.Laligned_loop:
LDR q0,[x10]
CMEQ v0.16b,v0.16b,#0
SHRN v0.8b,v0.8h,#4
FMOV x0,d0
CBZ x0,.Lnext_it
RBIT x0,x0
CLZ x0,x0
LSR x0,x0,#2
ADD x0,x0,x11
RET
.Lfound_null:
RET
END(_strlen)
This doesn’t make use of handy Aarch64 instruction paramaters such as
LDR qreg,[xreg,#imm]!
which increments by the immediate value before the load.
Improved implementation
Now we use improve our previous implementation by using SIMD Instructions for
the first chunk. We use a GPR load after the first check to improve performance
for short strings as a GPR move is required for the results and statistically
short strings are the most common in real world scenarios. This statement is
backed by a survey conducted by the LLVM project. I have
not been able to find a direct link to those results, but I will update this
post when I find them. See: https://code.ornl.gov/llvm-doe/llvm-project/-/tree/doe/libc/benchmarks/distributions
ENTRY(_strlen)
BIC x10,x0,#0xf
LDR q0,[x10]
CMEQ v0.16b,v0.16b,#0
SHRN v0.8b,v0.8h,#4
FMOV x1,d0 // move to GPR
LSL x2,x0,#2 // get the byte offset of the last processed
LSR x1,x1,x2 // align with offset
CBZ x1,.Lloop // jump if no hit
RBIT x1,x1
CLZ x0,x1
LSR x0,x0,#2
RET
.Lloop:
LDR q0,[x10,#16]! // increment by 16 and load
CMEQ v0.16b,v0.16b,#0
SHRN v0.8b,v0.8h,#4
fmov x1,d0 // get offset in case of hit
cbz x1,.Lloop // x1 is zero if no hit in segment
.Ldone:
SUB x0,x10,x0
RBIT x1,x1
CLZ x3,x1
LSR x3,x3,#2
ADD x0,x0,x3
RET
END(_strlen)
FCMP to avoid GPR move
After improving on the naïve implementation I realized that we can avoid a move
from a SIMD register to a GPR by using FCMP in the loop. I also realized that we
can avoid a few instructions if the input is already 16 byte aligned (see
.Laligned
), although it does introduce a new branch to be resolved.
The later benchmarks will indicate whether or not this is a worthwhile improvement.
ENTRY(_strlen)
BIC x10,x0,#0xf
AND x9,x0,#0xf
LDR q0,[x10]
CMEQ v0.16b,v0.16b,#0
SHRN v0.8b,v0.8h,#4
CBZ x9,.Laligned
FMOV x1,d0
LSL x2,x0,#2
LSR x1,x1,x2
CBZ x1,.Lloop
RBIT x1,x1
CLZ x0,x1
LSR x0,x0,#2
RET
.Laligned:
FMOV x1,d0
CBNZ x1,.Ldone
.Lloop:
LDR q0,[x10,#16]!
CMEQ v0.16b,v0.16b,#0
SHRN v0.8b,v0.8h,#4
fcmp d0,#0.0
B.EQ .Lloop
FMOV x1,d0
.Ldone:
SUB x0,x10,x0
RBIT x1,x1
CLZ x3,x1
LSR x3,x3,#2
ADD x0,x0,x3
RET
END(_strlen)
Now we can look at what further improvements can be done. We could avoid loop
carried dependencies, for each iteration a post increment is currently used to
go to next iteration, we could unroll the loop twice and increment x10 every two
iterations to make it easier for the CPU to run two iterations at once.
Aarch64 also has instructions for loading several SIMD registers at once using
the LD1
,LD2
… family of instructions. [16]
libc integration
Getting a string function written in assembly integrated into libc on FreeBSD
isn’t as big of an ordeal as it may sound like. We simply use the ENTRY()
and
END()
macros in the code and add the filenames to the associated
Makefile.inc
located at lib/libc/aarch64/string/Makefile.inc
like the
following:
@@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
strchrnul \
strcmp \
strcpy \
- memcmp \
strncmp \
strnlen \
strrchr
+MDSRCS+= \
+ memcmp.S
#
# Add the above functions. Generate an asm file that includes the needed
# Arm Optimized Routines file defining the function name to the libc name.
We can then build libc as a shared library and load it using LD_PRELOAD
for
running regression tests as running FreeBSD with a broken libc makes FreeBSD
very sad and prone to severe errors. It’s always nice to avoid a broken install
while debugging.
Building libc is as simple as
cd /usr/src/lib/libnetbsd && make
cd /usr/src/lib/libc && make
# OR
make -C /usr/src/lib/libc MAKEOBJDIRPREFIX=/tmp/objdir WITHOUT_TESTS=yes
As for debugging it’s as simple as loading up a test binary with lldb
and
setting a breakpoint for the string function being developed, strlen
in our
case.
Tests
FreeBSD comes bundled with an excellent Test Suite [14], they are written
using a test framework Kyua
with the ATF
library. The FreeBSD wiki page has
all the information necessary for this and there’s no need for my to repeat it
here. :-)
Running the tests is as simple as going to /usr/src/lib/libc/tests/string
and
running make check
. If you haven’t already run buildworld
then you will need
to build lib/libnetbsd
first. This as FreeBSD also borrows some tests from
upstream NetBSD located at /usr/src/contrib/netbsd-tests/lib/libc/string
.
Benchmarks
Benchmarks are executed using fuz’ strperf program [17], it’s output is compatible with benchstat from /devel/go-perf. I benchmark all the implementations against the implementations in libc, I also tried running the code on a Raspberry Pi 5 running debian to see strlen holds up against glibc’s implementation.
I have benchmarked the previously described implementations. Seeing performance
impact of substituting a GPR move followed by a bnz
compared to a FCMP
followed by a b.eq
and the impact of branching immediately in the case of an
already aligned string.
It’s also important to note that these implementations are hardware dependent as different cores may utilize more or fewer pipelines for specific instructions.
To test against glibc I borrowed a Raspbery Pi5 running debian and installed
bsd-make
to compile strperf, apt install bmake
.
Then to generate benchmark results it’s as simple as:
for i in {1..20}; do ./strlen >> results/${TEST}; done
This produced the following results when evaluated with benchstat
.
You might need to scroll horizontally to view all the results.
os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
│ libc_Scalar │ libc_ARM │ GPR │ GPR_aligned │ FCMP │ FCMP_aligned │
│ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ sec/op vs base │ sec/op vs base │
Short 186.9µ ± 1% 134.6µ ± 0% -28.01% (p=0.000 n=20) 121.0µ ± 0% -35.26% (p=0.000 n=20) 118.5µ ± 0% -36.62% (p=0.000 n=20) 121.8µ ± 0% -34.85% (p=0.000 n=20) 119.8µ ± 0% -35.91% (p=0.000 n=20)
Mid 45.05µ ± 1% 37.07µ ± 0% -17.73% (p=0.000 n=20) 33.36µ ± 0% -25.96% (p=0.000 n=20) 30.43µ ± 1% -32.45% (p=0.000 n=20) 33.37µ ± 0% -25.93% (p=0.000 n=20) 29.98µ ± 1% -33.45% (p=0.000 n=20)
Long 13.894µ ± 0% 4.442µ ± 0% -68.03% (p=0.000 n=20) 6.978µ ± 0% -49.78% (p=0.000 n=20) 6.977µ ± 0% -49.79% (p=0.000 n=20) 6.852µ ± 0% -50.68% (p=0.000 n=20) 5.627µ ± 0% -59.50% (p=0.000 n=20)
geomean 48.91µ 28.09µ -42.58% 30.43µ -37.79% 29.30µ -40.09% 30.31µ -38.03% 27.24µ -44.31%
│ libc_Scalar │ libc_ARM │ GPR │ GPR_aligned │ FCMP │ FCMP_aligned │
│ B/s │ B/s vs base │ B/s vs base │ B/s vs base │ B/s vs base │ B/s vs base │
Short 637.7Mi ± 1% 885.9Mi ± 0% +38.91% (p=0.000 n=20) 985.1Mi ± 0% +54.47% (p=0.000 n=20) 1006.1Mi ± 0% +57.77% (p=0.000 n=20) 978.9Mi ± 0% +53.49% (p=0.000 n=20) 995.0Mi ± 0% +56.02% (p=0.000 n=20)
Mid 2.584Gi ± 1% 3.141Gi ± 0% +21.55% (p=0.000 n=20) 3.490Gi ± 0% +35.07% (p=0.000 n=20) 3.825Gi ± 1% +48.03% (p=0.000 n=20) 3.488Gi ± 0% +35.01% (p=0.000 n=20) 3.883Gi ± 1% +50.26% (p=0.000 n=20)
Long 8.379Gi ± 0% 26.210Gi ± 0% +212.81% (p=0.000 n=20) 16.684Gi ± 0% +99.12% (p=0.000 n=20) 16.686Gi ± 0% +99.15% (p=0.000 n=20) 16.990Gi ± 0% +102.77% (p=0.000 n=20) 20.690Gi ± 0% +146.93% (p=0.000 n=20)
geomean 2.380Gi 4.145Gi +74.15% 3.826Gi +60.76% 3.973Gi +66.92% 3.841Gi +61.37% 4.274Gi +79.56%
os: Linux
arch: aarch64
│ strlen_glibc │
│ sec/op │
Short 132.1µ ± 0%
Mid 36.29µ ± 1%
Long 4.365µ ± 4%
geomean 27.55µ
│ strlen_glibc │
│ B/s │
Short 902.7Mi ± 0%
Mid 3.208Gi ± 1%
Long 26.67Gi ± 4%
geomean 4.225Gi
What’s next
Despite resulting in worse perfomance for longer strings I will now continue the
porting effort and translate memcmp
to Aarch64 NEON. I will continue
optimizing strlen
by unrolling the main loop twice as previously mentioned.
So instead of the current LDR q0,[x10,#16]!
I’ll do a LDP q1, q2, [x10, #32]!
Hopefully I can get that strlen done by the end of the week and submit it for review to the FreeBSD Phabricator instance.
I also attended the LundLinuxCon [18] last week and saw a really interesting talk regarding safer flexible arrays in the Linux kernel [19]. It involved a new warning flag and some niceties that were recently merged into GCC15 [20] and LLVM18 [21], I will try building FreeBSD with those flags and see how many warning are present, but first I need to read up a littlebit whether or not FreeBSD even permits the use of flexible arrays in the kernel. :-)
References
[1] https://summerofcode.withgoogle.com/
[2] https://dflund.se/~getz/GSOC/FreeBSDproposal.txt
[3] https://wiki.freebsd.org/SummerOfCodeIdeas#Port_of_libc_SIMD_enhancements_to_other_architectures
[4] https://strajabot.com/
[5] https://graphics.stanford.edu/~seander/bithacks.html
[6] https://man.freebsd.org/cgi/man.cgi?query=simd&manpath=FreeBSD+15.0-CURRENT
[7] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/amd64/string/strlen.S
[8] https://www.felixcloutier.com/x86/pmovmskb
[9] https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/
[10]
https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon
[11] https://www.corsix.org/content/whirlwind-tour-aarch64-vector-instructions
[12] https://git.sr.ht/~getz/aarch64_string.h/
[13] https://github.com/soppelmann/freebsd-src/
[14] https://wiki.freebsd.org/TestSuite/
[15] https://developer.arm.com/documentation/PJDOC-466751330-593177/latest/
[16] https://www.scs.stanford.edu/~zyedidia/arm64/ld2_advsimd_mult.html
[17] https://github.com/clausecker/strperf
[18] https://lundlinuxcon.org
[19] https://embeddedor.com/slides/2024/llc/llc2024.pdf
[20] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108896
[21] https://github.com/llvm/llvm-project/pull/76348
Bonus: Hello World in Aarch64 Assembly
The FreeBSD Developers’ Handbook has a section on writing assembly, but it’s targeted towards x86 and is rather outdated. Using it’s suggestions Hello World would look like this.
.text
.global _start
kernel:
int $0x80
ret
_start:
mov $4, %rax
mov $1, %rdi
mov $message, %rsi
mov $13, %rdx
call kernel
mov $1, %rax
mov $69, %rdi
syscall
message:
.ascii "Hello, world\n"
But INT 80h
is much slower than syscall
or simply using svc #0
on arm64,
this is because a lot of microcode is run when INT
is executed, whereas the
microcode for syscall
is much simpler. Although INT 80h
still works on
amd64!
Whereas 0x80
is the i386 syscall interface it incidentally works for
amd64 tasks because we don’t check if the process doing a syscall is a 32 bit or
64 bit one. But arguments are truncated to 32 bits, so you can’t e.g. pass
pointers to the stack! And arm64 ofcourse doesn’t have the INT
instruction.
In general, the method of doing syscalls is different on each architecture.
FreeBSD is moving towards what win32 and solaris already pioneered: syscalls
should be done by calling library functions so the kernel ABI and API can be
adapted in the future. For this reason syscalls will be split into a new library
libsys
in FreeBSD 15.
If you’re wondering how to figure out what number each syscall corresponds to then you can check https://cgit.freebsd.org/src/tree/sys/kern/syscalls.master
/*
Compile with the following for non Aarch64 host:
aarch64-unknown-freebsd14.0-gcc13 --sysroot /usr/local/freebsd-sysroot/aarch64 hello.S -nostdlib
Run with:
qemu-aarch64-static ./a.out
*/
.text
/* Our application's entry point. */
.global _start
_start:
/* syscall write(int fd, const void *buf, size_t count) */
mov x0, #1 /* fd := STDOUT_FILENO */
ldr x1, = msg /* buf := msg */
ldr x2, = len /* count := len */
mov w8, #4 /* write is syscall #4 */
svc #0 /* invoke syscall */
/* syscall exit(int status) */
mov x0, #69 /* status := 69 */
mov w8, #1 /* exit is syscall #1 */
svc #0 /* invoke syscall */
/* Data segment: define our message string and calculate it's length. */
.data
msg:
.ascii "Hello, world\n"
len = . - msg