Getz Mikalsen Table of Contents _________________ 1. GSoC status update #2 (memcmp) .. 1. What is memcmp even? ..... 1. Here be dragons .. 2. Scalar compiler output .. 3. amd64 SIMD implementation 2. Translating to Aarch64 .. 1. Stay within your page .. 2. wide shifts 3. Benchmarks 4. References 1 GSoC status update #2 (memcmp) ================================ 14-06-2024 *Table of Contents* - [GSoC status update #2 (memcmp)] - [What is memcmp even?] - [Here be dragons] - [Scalar compiler output] - [amd64 SIMD implementation] - [Translating to Aarch64] - [Stay within your page] - [wide shifts] - [Benchmarks] - [References] [GSoC status update #2 (memcmp)] See section 1 [What is memcmp even?] See section 1.1 [Here be dragons] See section 1.1.1 [Scalar compiler output] See section 1.2 [amd64 SIMD implementation] See section 1.3 [Translating to Aarch64] See section 2 [Stay within your page] See section 2.1 [wide shifts] See section 2.2 [Benchmarks] See section 3 [References] See section 4 1.1 What is memcmp even? ~~~~~~~~~~~~~~~~~~~~~~~~ The man page on FreeBSD has the following to say: ,---- | MEMCMP(3) FreeBSD Library Functions Manual MEMCMP(3) | | NAME | memcmp – compare byte string | | LIBRARY | Standard C Library (libc, -lc) | | SYNOPSIS | #include | | int | memcmp(const void *b1, const void *b2, size_t len); | | DESCRIPTION | The memcmp() function compares byte string b1 against byte string b2. | Both strings are assumed to be len bytes long. | | RETURN VALUES | The memcmp() function returns zero if the two strings are identical, | otherwise returns the difference between the first two differing bytes | (treated as unsigned char values, so that ‘\200’ is greater than ‘\0’, | for example). Zero-length strings are always identical. | | SEE ALSO | bcmp(3), strcasecmp(3), strcmp(3), strcoll(3), strxfrm(3), | timingsafe_memcmp(3), wmemcmp(3) | | STANDARDS | The memcmp() function conforms to ISO/IEC 9899:1990 (“ISO C90”). | | FreeBSD 15.0-CURRENT August 15, 2016 FreeBSD 15.0-CURRENT `---- Now when starting off coding anything I like to begin with a minimum example, a PoC (Proof of Concept). So for these string functions I first make a small wrapper program to either call the libc function or my assembly function. Then I compare the output so get some quick turnaround times before integrating it into libc and running the proper unit tests. So I did just that, and to my surprise libc wasn't returning the expected values. 1.1.1 Here be dragons --------------------- So when I run my little wrapper program with the following: ,---- | #include | #include | | int main() { | printf("%i\n", memcmp("\200", "\0", 1)); | } `---- It returns 1, thats odd. However, ISO/IEC 9899:1990 only specifies that a number greater than, equal to, or less than zero shall be returned. So going by the standard it's a valid return value. But it doesn't match up with what the FreeBSD man page states. But if we edit our program a bit like the following we notice something interesting. ,---- | #include | #include | | int main() { | printf("%i\n", memcmp((char[] ) {"\200"}, "\0", 1)); | } `---- It returns `128', as expected! The reason for this is that on `Aarch64' we have inherited string functions from the `Arm Optimized Routines'. These functions aren't originally written with FreeBSD in mind and this has simply gone unnoticed. If we instead glance over at the manpage for memcmp on `glibc' we see the following regarding its return value: ,---- | RETURN VALUE | | The memcmp() function returns an integer less than, equal to, or greater than | zero if the first n bytes of s1 is found, respectively, to be less than, to | match, or be greater than the first n bytes of s2. | | For a nonzero return value, the sign is determined by the sign of the difference | between the first pair of bytes (interpreted as unsigned char) that differ in s1 | and s2. | | If n is zero, the return value is zero. `---- I strongly suspect that nobody is checking the return value beyond <, =, or >, except for the units tests... Which do fail on FreeBSD! I generated this result page by using the `kyua' command `kyua report-html -r ${TEST}.db', everything you need to know about the FreeBSD test suite is available on the FreeBSD wiki. Finally, these inherited `Arm Optimized Routines' are located in `/usr/src/contrib/arm-optimized-routines' and plugged in to the string functions `Makefile.inc' like the following: ,---- | AARCH64_STRING_FUNCS= \ | memchr \ | memcmp \ | memcpy \ | memmove \ | memrchr \ | memset \ | stpcpy \ | strchr \ | strchrnul \ | strcmp \ | strcpy \ | strlen \ | strncmp \ | strnlen \ | strrchr | | # | # Add the above functions. Generate an asm file that includes the needed | # Arm Optimized Routines file defining the function name to the libc name. | # Some file need multiple macros defined or a weak symbol added we can | # override the generated file in these cases. | # | .for FUNC in ${AARCH64_STRING_FUNCS} | .if !exists(${FUNC}.S) | ${FUNC}.S: | printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET} | printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET} | printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET} | CLEANFILES+= ${FUNC}.S | .endif | | MDSRCS+= ${FUNC}.S | CFLAGS.${FUNC}.S+=-I${SRCTOP}/contrib/arm-optimized-routines/string | .endfor `---- There we can swap out the faulty one for a compliant compiler generated implementation, but lets get going with porting ours! ,---- | @@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \ | strchrnul \ | strcmp \ | strcpy \ | - memcmp \ | strncmp \ | strnlen \ | strrchr | | +MDSRCS+= \ | + memcmp.S | # | # Add the above functions. Generate an asm file that includes the needed | # Arm Optimized Routines file defining the function name to the libc name. `---- These functions already have ARM NEON enchancements from the Arm Optimized Routines ,---- | memchr | x | memcmp | x | memcpy | SWAR | memmove | SWAR | memrchr | x | memset | x | stpcpy | x | strchr | x | strchrnul | x | strcmp | SWAR | strcpy | x | strlen | x | strncmp | SWAR | strnlen | x | strrchr | x `---- Now I wont be touching memcpy as its implementation could benefit from another design [[1]]. But if we ignore the functions with pre-existing implementations by ARM we are left with the following list. But lets finish off `memcmp' first because we already got started on it and maybe we can beat Arm at their own game. ,---- | FUNCTION AMD64 | bcmp S1 | index S1 | memccpy S1 | rindex S1 | stpncpy S1 | strcat S1 | strcmp S1 | strcspn S2 | strlcat S1 | strlcpy S1 | strncat S1 | strncmp S1 | strncpy S1 | strpbrk S2 | strsep S2 | strspn S2 | timingsafe_bcmp S1 `---- [[1] 1.2 Scalar compiler output ~~~~~~~~~~~~~~~~~~~~~~~~~~ After the above small hiccup I took a look at the `.c' code for memcmp and its generated assembly. Nothing particularly notable here
memcmp assembly listing
1.3 amd64 SIMD implementation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now this is a long one ,---- | ARCHENTRY(memcmp, baseline) | cmp $32, %rdx # enough to permit use of the long kernel? | ja .Llong | | test %rdx, %rdx # zero bytes buffer? | je .L0 | | /* | * Compare strings of 1--32 bytes. We want to do this by | * loading into two xmm registers and then comparing. To avoid | * crossing into unmapped pages, we either load 32 bytes from | * the start of the buffer or 32 bytes before its end, depending | * on whether there is a page boundary between the overread area | * or not. | */ | | /* check for page boundaries overreads */ | lea 31(%rdi), %eax # end of overread | lea 31(%rsi), %r8d | lea -1(%rdi, %rdx, 1), %ecx # last character in buffer | lea -1(%rsi, %rdx, 1), %r9d | xor %ecx, %eax | xor %r9d, %r8d | test $PAGE_SIZE, %eax # are they on different pages? | jz 0f | | /* fix up rdi */ | movdqu -32(%rdi, %rdx, 1), %xmm0 | movdqu -16(%rdi, %rdx, 1), %xmm1 | lea -8(%rsp), %rdi # end of replacement buffer | sub %rdx, %rdi # start of replacement buffer | movdqa %xmm0, -40(%rsp) # copy to replacement buffer | movdqa %xmm1, -24(%rsp) | | 0: test $PAGE_SIZE, %r8d | jz 0f | | /* fix up rsi */ | movdqu -32(%rsi, %rdx, 1), %xmm0 | movdqu -16(%rsi, %rdx, 1), %xmm1 | lea -40(%rsp), %rsi # end of replacement buffer | sub %rdx, %rsi # start of replacement buffer | movdqa %xmm0, -72(%rsp) # copy to replacement buffer | movdqa %xmm1, -56(%rsp) | | /* load data and compare properly */ | 0: movdqu 16(%rdi), %xmm1 | movdqu 16(%rsi), %xmm3 | movdqu (%rdi), %xmm0 | movdqu (%rsi), %xmm2 | mov %edx, %ecx | mov $-1, %edx | shl %cl, %rdx # ones where the buffer is not | pcmpeqb %xmm3, %xmm1 | pcmpeqb %xmm2, %xmm0 | pmovmskb %xmm1, %ecx | pmovmskb %xmm0, %eax | shl $16, %ecx | or %ecx, %eax # ones where the buffers match | or %edx, %eax # including where the buffer is not | not %eax # ones where there is a mismatch | #ifndef BCMP | bsf %eax, %edx # location of the first mismatch | cmovz %eax, %edx # including if there is no mismatch | movzbl (%rdi, %rdx, 1), %eax # mismatching bytes | movzbl (%rsi, %rdx, 1), %edx | sub %edx, %eax | #endif | ret | | /* empty input */ | .L0: xor %eax, %eax | ret | | /* compare 33+ bytes */ | ALIGN_TEXT | .Llong: movdqu (%rdi), %xmm0 # load head | movdqu (%rsi), %xmm2 | mov %rdi, %rcx | sub %rdi, %rsi # express rsi as distance from rdi | and $~0xf, %rdi # align rdi to 16 bytes | movdqu 16(%rsi, %rdi, 1), %xmm1 | pcmpeqb 16(%rdi), %xmm1 # compare second half of this iteration | add %rcx, %rdx # pointer to last byte in buffer | jc .Loverflow # did this overflow? | 0: pcmpeqb %xmm2, %xmm0 | pmovmskb %xmm0, %eax | xor $0xffff, %eax # any mismatch? | jne .Lmismatch_head | add $64, %rdi # advance to next iteration | jmp 1f # and get going with the loop | | /* | * If we got here, a buffer length was passed to memcmp(a, b, len) | * such that a + len < a. While this sort of usage is illegal, | * it is plausible that a caller tries to do something like | * memcmp(a, b, SIZE_MAX) if a and b are known to differ, intending | * for memcmp() to stop comparing at the first mismatch. This | * behaviour is not guaranteed by any version of ISO/IEC 9899, | * but usually works out in practice. Let's try to make this | * case work by comparing until the end of the address space. | */ | .Loverflow: | mov $-1, %rdx # compare until the end of memory | jmp 0b | | /* process buffer 32 bytes at a time */ | ALIGN_TEXT | 0: movdqu -32(%rsi, %rdi, 1), %xmm0 | movdqu -16(%rsi, %rdi, 1), %xmm1 | pcmpeqb -32(%rdi), %xmm0 | pcmpeqb -16(%rdi), %xmm1 | add $32, %rdi # advance to next iteration | 1: pand %xmm0, %xmm1 # 0xff where both halves matched | pmovmskb %xmm1, %eax | cmp $0xffff, %eax # all bytes matched? | jne .Lmismatch | cmp %rdx, %rdi # end of buffer reached? | jb 0b | | /* less than 32 bytes left to compare */ | movdqu -16(%rdx), %xmm1 # load 32 byte tail through end pointer | movdqu -16(%rdx, %rsi, 1), %xmm3 | movdqu -32(%rdx), %xmm0 | movdqu -32(%rdx, %rsi, 1), %xmm2 | pcmpeqb %xmm3, %xmm1 | pcmpeqb %xmm2, %xmm0 | pmovmskb %xmm1, %ecx | pmovmskb %xmm0, %eax | shl $16, %ecx | or %ecx, %eax # ones where the buffers match | not %eax # ones where there is a mismatch | #ifndef BCMP | bsf %eax, %ecx # location of the first mismatch | cmovz %eax, %ecx # including if there is no mismatch | add %rcx, %rdx # pointer to potential mismatch | movzbl -32(%rdx), %eax # mismatching bytes | movzbl -32(%rdx, %rsi, 1), %edx | sub %edx, %eax | #endif | ret | | #ifdef BCMP | .Lmismatch: | mov $1, %eax | .Lmismatch_head: | ret | #else /* memcmp */ | .Lmismatch_head: | tzcnt %eax, %eax # location of mismatch | add %rax, %rcx # pointer to mismatch | movzbl (%rcx), %eax # mismatching bytes | movzbl (%rcx, %rsi, 1), %ecx | sub %ecx, %eax | ret | | .Lmismatch: | movdqu -48(%rsi, %rdi, 1), %xmm1 | pcmpeqb -48(%rdi), %xmm1 # reconstruct xmm1 before PAND | pmovmskb %xmm0, %eax # mismatches in first 16 bytes | pmovmskb %xmm1, %edx # mismatches in second 16 bytes | shl $16, %edx | or %edx, %eax # mismatches in both | not %eax # matches in both | tzcnt %eax, %eax # location of mismatch | add %rax, %rdi # pointer to mismatch | movzbl -64(%rdi), %eax # mismatching bytes | movzbl -64(%rdi, %rsi, 1), %ecx | sub %ecx, %eax | ret | #endif | ARCHEND(memcmp, baseline) `---- The following table might be helpful when reading unless the x86 registers are already ingrained in your head. :-)
Name Notes Type 64-bit long 32-bit int 16-bit short 8-bit char
rax Values are returned from functions in this register.  scratch rax eax ax ah and al
rcx Typical scratch register.  Some instructions also use it as a counter. scratch rcx ecx cx ch and cl
rdx Scratch register. Function argument #3 in 64-bit Linux scratch rdx edx dx dh and dl
rbx Preserved register: don't use it without saving it! preserved rbx ebx bx bh and bl
rsp The stack pointer.  Points to the top of the stack preserved rsp esp sp spl
rbp Preserved register.  Sometimes used to store the old value of the stack pointer, or the "base". preserved rbp ebp bp bpl
rsi Scratch register.  Also used to pass function argument #2 in 64-bit Linux scratch rsi esi si sil
rdi Scratch register.  Function argument #1 in 64-bit Linux scratch rdi edi di dil
r8 Scratch register.  These were added in 64-bit mode, so they have numbers, not names. scratch r8 r8d r8w r8b
r9 Scratch register. scratch r9 r9d r9w r9b
r10 Scratch register. scratch r10 r10d r10w r10b
r11 Scratch register. scratch r11 r11d r11w r11b
r12 Preserved register.  You can use it, but you need to save and restore it. preserved r12 r12d r12w r12b
r13 Preserved register. preserved r13 r13d r13w r13b
r14 Preserved register. preserved r14 r14d r14w r14b
r15 Preserved register. preserved r15 r15d r15w r15b
X86 Registers courtesy of [2]
Luckily this code is well commented, but in short it does the following: - Checks if we are checking more than 32 bytes - If not it checks if the bytes we are about to check will cross a page boundary as we check 32 bytes simultaneously. For example if we are to check 20 bytes but the page boundary is at 25 bytes from the beginning of one of our inputs. - If the above is the case we copy the data to a replacement/bounce buffer and override rdi/rsi. - If no funny business is required we simply start comparing data in a similar way as to `strlen' except now we have a mask where we check if everything is matching or not. So we look for the occurence of the first 00 rather than FF as we did in `strlen. - Then once we do hit we load the word at the offset from our base address and subtract them before we return [2] 2 Translating to Aarch64 ======================== To begin with we ignore the special case regarding page crossing and save it for last, like a dessert. We start loading registers pairwise, so 32 bytes are immediately loaded. Then we start comparing using cmeq like we did with `strlen' except that we compared with `#0' there. then we need to reduce the width so the result can fit in a GPR, we use `shrn' for that, just as we did before. But for matches the result is `FF', so now the register is all `FF''s if they were identical and if they were identical we want to keep looping. But the problem is that we can't do a floating point compare with `FF' as its `NaN', therefore we move it to a GPR and invert it. Inverting it also allows us to count the zeroes and if we have 32 zeroes (`FF''s before inverting) we had no matches and we can keep looking! To make sure we dont look for too long we do a comparison with the max limit first, if we are past our limit and no matches were found then we simply return 0. ,---- | .Lbegin: | mov x8,x0 // store base address for later | mov x9,x1 | | ldp q0,q1,[x0] //load 32 bytes of b1 into vector registers | ldp q2,q3,[x1] | | cmeq v0.16b,v0.16b,v2.16b // do compare between 0-15 b1 vs b2 | shrn v0.8b,v0.8h,#4 // shift them right to fit in x1 | cmeq v1.16b,v1.16b,v3.16b // do compare between 16-31 b1 vs b2 | shrn v1.8b,v1.8h,#4 | | fmov x1,d0 | fmov x3,d1 | | mvn x0,x1 // invert to use clz | cbz x0,0f | rbit x0,x0 | clz x0,x0 // if this is zero check bytes 16..32 | b 1f | | 0: | rbit x1,x3 | mvn x1,x1 | clz x0,x1 | add x0,x0,#64 | 1: | lsr x0,x0,#2 | cmp x0,limit // if we have more matches than the limit return 0 | b.ge .Lnone | cmp x0,#32 // x0 == 32 if no hit (32 0's) | b.eq .Lloop | 2: | ldrb w4,[x8,x0] // gonna need to edit this based on align offset | ldrb w5,[x9,x0] | subs w0,w4,w5 // get the byte difference | ret `---- Now this approach works for up to 32 byte comparisons but we want to turn it into a loop, its quite straight forward with the way the code was written. Simpy make the loads increment the pointer and subtract from the limit. ,---- | subs limit,limit,#32 | b.lt .Lnone //we could do special handling here | ldp q0,q1,[x8,32]! //load 32 bytes of b1 into vector registers | ldp q2,q3,[x9,32]! `---- Now the problem with this approach is that the loop condition is buried deep within a code path so even with speculative execution etc it will struggle. Now we could try doing something like this: ,---- | .Lloop: | subs limit,limit,#32 | b.lt .Lnone //we could do special handling here | ldp q0,q1,[x8,32]! //load 32 bytes of b1 into vector registers | ldp q2,q3,[x9,32]! | | cmeq v0.16b,v0.16b,v2.16b // do compare between 0-15 b1 vs b2 | cmeq v1.16b,v1.16b,v3.16b // do compare between 16-32 b1 vs b2 | | umin v2.16b,v0.16b,v1.16b | umin v2.16b,v2.16b,v2.16b // separate lowest value | NOT v2.16b,v2.16b // invert it (FF -> we want to loop) | shrn v2.8b,v2.8h,#4 | fcmp d2,#0.0 | b.eq .Lloop `---- But benchmarks showed that this resulted in better performance but I tried a few other ways to tighten the loop that turned out to be even faster. ,---- | .Lloop: | subs limit,limit,#32 | b.lt .Lnone //we could do special handling here | ldp q0,q1,[x8,32]! //load 32 bytes of b1 into vector registers | ldp q2,q3,[x9,32]! | | eor v4.16b,v0.16b,v2.16b // v0 = b1(16-32) XOR b2(16-32) | eor v5.16b,v1.16b,v3.16b // v1 = b1(32-48) XOR b2(32-48) | umaxp v4.16b,v4.16b,v5.16b // v0 = max(v0,v1) | umaxp v4.16b,v4.16b,v4.16b // v0 = max(v0,v0) | fmov tmp, d4 | cbz tmp, .Lloop | | cmeq v0.16b,v0.16b,v2.16b // do compare between 0-15 b1 vs b2 | cmeq v1.16b,v1.16b,v3.16b // do compare between 16-32 b1 vs b2 `---- Despite not reusing the result from cmeq this resulted in a few percent better perfomance across several Arm64 cores. This could be attributed to `cmeq' and `shrn' utilizing fewer pipelines than `eor' and `umaxp'. This is shown for the Neoverse V2 core the the optimization guide. [[3]] A nice thing I realized is that this quick check is also applicable for the first 32 byte iteration aswell. Adding it there resulted in almost a 50% improvement for short string using the now familiar suite of benchmarks.
Aarch64 registers.
[[3] 2.1 Stay within your page ~~~~~~~~~~~~~~~~~~~~~~~~~ Now to the dessert, the crossing into unmapped pages. We put our memcmp through the unit tests as previously described and we get the following. One failed! But why? And it's not even a test of memcmp! If we look into the log we notice the following ,---- | Process with PID 9597 exited with signal 11 and dumped core; attempting to gather stack trace | [New LWP 120611] | Core was generated by `/usr/obj/home/getz/gsoc/freebsd-src/arm64.aarch64/lib/libc/tests/string/checkdir'. | Program terminated with signal SIGSEGV, Segmentation fault. | Address not mapped to object. | #0 memcmp () at /home/getz/gsoc/freebsd-src/lib/libc/aarch64/string/memcmp.S:78 | 78 ldp q2,q3,[x1] //load 32 bytes of b2 into vector registers `---- So it's a `SIGSEV' due to a address not being mapped, i.e. an unmapped page. If we now run it through gdb (or lldb) we see the following values for `x0' and `x1' right before the attempted load. ,---- | >>> p/tz $x0 | $2 = 0x0000000041035000 | >>> p/tz $x1 | $3 = 0x0000000041036fff `---- To handle this we can first make check if we are allowed to check more than 32 bytes (limit > 32). If not we can check how many bytes are left to the next page crossing. If its less than 32 bytes we can load our 32 bytes from the buffers pointer minus the offset. Then after its loaded we can shift the entire vector register to get rid of the unwanted bytes. On SSE there is a `PSHUFB' instruction that does this and luckily there is a more or less equivalent instruction for Aarch64 `TBL'. We need to use TBL as the other shift instructions like `EXT' and `USHL' either dont work with variable shifts or across entire vector registers. `USHL' would shift each byte a certain amount depending on the arrangement specifier (the .16b piece you often see). ---------------------------------------------------------------------- The numbers followed by a letter for the vector registers are the arrangement specifiers is the number and size of the elements in the vector. A simple way to keep track of them is by thinking: - B = Bytes (8bit) - H = Halfwords (16bit) - S = Single words (32bit) - D = Double words (64bit) courtesy of [[4]] So for checking if we are about to cross a page we can do the following ,---- | cmp x2,#32 | b.hi .Lbegin // Jump to regular loop if we are allowed to check more than 32 bytes | add x3,x8,#32 | add x4,x9,#32 | eor x3,x3,x8 | eor x4,x4,x9 | tst w3,#PAGE_SIZE | b.ne .Lspecialhandling | tst w4,#PAGE_SIZE | b.ne .Lspecialhandling `---- Then theres also a piece that was overlooked in the main loop. As neither buf1 or buf2 is aligned there is a chance that we may overread when theres less than 32 bytes left to compare. This is easily solved by adding a new branch where we previously checked if the limit had gone down to 0. ,---- | .Lloop: | subs x2,x2,#32 | b.le .Lnone | cmp x2,#32 | b.le .Llast32 | ldp q0,q1,[x8,32]! // load 32 bytes to vector registers | ldp q2,q3,[x9,32]! | | eor v4.16b,v0.16b,v2.16b | eor v5.16b,v1.16b,v3.16b | umaxp v4.16b,v4.16b,v5.16b | umaxp v4.16b,v4.16b,v4.16b | fmov x6,d4 | cbz x6,.Lloop | b .Lmatch | | /* If 32 bytes left to compare only load 32 bytes from x8 - limit to | * avoid overread */ | .Llast32: | mov x3,#32 | sub x3,x3,x2 // x3 = 32 - x2 | add x2,x2,x3 // add the amount we shifted to limit | sub x8,x8,x3 | sub x9,x9,x3 | | ldp q0,q1,[x8,#32]! // load 32 bytes to vector registers | ldp q2,q3,[x9,#32]! | | .Lmatch: | cmeq v0.16b,v0.16b,v2.16b // compare between 0-15 b1 vs b2 | cmeq v1.16b,v1.16b,v3.16b // compare between 16-31 b1 vs b2 `---- [[4] 2.2 wide shifts ~~~~~~~~~~~~~~~ So now the only piece left in the puzzle is to shift the vector register for the case when we have 32 bytes or less to compare and they are located near the end of a page. To utilize `TBL' here we need to generate a suitable permutation mask, to do this we can take the identity permutation `0, 1, 2, ..., 15, 16, 17, ..., 31' and add to each byte to rotate it, or load from it with a desired offset. This can table can easily be prepared in a data segment. ,---- | .section .rodata | .p2align 4 | shift_table: | .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 | .byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 `---- `USHR/USHL' could also work but as it can only shift 64 bit words it would only be useful for very short strings unless there some nice way to utilize it here. The `TBL/TBL' instructions are rather tricky to figure out how they work based on the reference manual but ARM provides a graphical representation of how they operate which cleared some things out for me.
TBL Lookup table instruction. courtesy of [5] (click to enlarge)
We simply check if the 32nd byte from our buffers starting address is located on another page as defined by the `$PAGE_SIZE' preprocessor directive defined in `sys/arm64/include/param.h'. Then we shift and load by the appropriate amount. So the check and fix can be implemented like the following: ,---- | ENTRY(_memcmp) | | mov x8, x0 | mov x9, x1 | cbz x2, .Lnone // 0 length | | /* | * Check if buffer is located at end of page to avoid crossing | * into unmapped page. If so, we load 32 bytes from the end of the | * limit and check the other buffer. | */ | | cmp x2, #32 | b.hi .Lbegin | add x3, x8, #32 | add x4, x9, #32 | eor x3, x3, x8 | eor x4, x4, x9 | | tst w3, #PAGE_SIZE | b.eq 0f | | mov x3, #32 | sub x3, x3, x2 | sub x8, x8, x3 | | /* | * We perform a variable shift in the vector register using TBL, | * a suitable permutation is generated by loading a table of bytes | * with a desired offset. | */ | | adrp x0, shift_table | add x0, x0, :lo12:shift_table | add x0, x0, x3 | ldp q0, q1, [x8] | ldp q4, q5, [x0] // load permutation table | tbl v0.16b, {v0.16b, v1.16b}, v4.16b | tbl v1.16b, {v0.16b, v1.16b}, v5.16b | add x8, x8, x3 // reset pointer to beginning of src | b 1f | | 0: | ldp q0, q1, [x8] | | 1: | tst w4, #PAGE_SIZE | b.eq 0f | | mov x3, #32 | sub x3, x3, x2 | sub x9, x9, x3 | | ldp q2, q3, [x9] | adrp x0, shift_table | add x0, x0, :lo12:shift_table | add x0, x0, x3 | ldp q4, q5, [x0] | tbl v2.16b, {v2.16b, v3.16b}, v4.16b | tbl v3.16b, {v2.16b, v3.16b}, v5.16b | add x9, x9, x3 | b 1f | | /* | * Compare strings of 1--32 bytes. We do this by loading into two | * vector registers and then doing a quick compare with XOR, UMAXP | * do determine if the first 32 bytes all match. | */ | .Lbegin: | ldp q0, q1, [x8] | 0: | ldp q2, q3, [x9] | 1: | | /* | * REST OF CODE HERE | */ | | END(_memcmp) | | .section .rodata | .p2align 4 | shift_table: | .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 | .byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 `---- 3 Benchmarks ============ Collin Percival was kind enough to give me a voucher for AWS credits so now I can also benchmark on a more powerful ARM core which might be more common to find running a FreeBSD system. ,---- | os: FreeBSD | arch: arm64 | cpu: ARM Neoverse-V1 r1p1 | │ memcmpScalar │ memcmpARM │ memcmpSIMD │ | │ sec/op │ sec/op vs base │ sec/op vs base │ | MemcmpShort 139.41µ ± 2% 63.96µ ± 1% -54.12% (p=0.000 n=20) 25.14µ ± 1% -81.97% (p=0.000 n=20) | MemcmpMid 93.38µ ± 3% 12.09µ ± 1% -87.05% (p=0.000 n=20) 10.91µ ± 1% -88.31% (p=0.000 n=20) | MemcmpLong 86.722µ ± 7% 4.648µ ± 1% -94.64% (p=0.000 n=20) 4.931µ ± 0% -94.31% (p=0.000 n=20) | geomean 104.1µ 15.32µ -85.29% 11.06µ -89.38% | | │ memcmpScalar │ memcmpARM │ memcmpSIMD │ | │ B/s │ B/s vs base │ B/s vs base │ | MemcmpShort 855.1Mi ± 2% 1864.0Mi ± 1% +117.99% (p=0.000 n=20) 4742.2Mi ± 1% +454.58% (p=0.000 n=20) | MemcmpMid 1.247Gi ± 3% 9.629Gi ± 1% +671.89% (p=0.000 n=20) 10.668Gi ± 1% +755.20% (p=0.000 n=20) | MemcmpLong 1.342Gi ± 6% 25.048Gi ± 1% +1765.92% (p=0.000 n=20) 23.608Gi ± 0% +1658.68% (p=0.000 n=20) | geomean 1.118Gi 7.600Gi +579.66% 10.53Gi +841.33% | | os: FreeBSD | arch: arm64 | cpu: ARM Cortex-A76 r4p1 | │ memcmpScalar │ memcmpARM │ memcmpSIMD │ | │ sec/op │ sec/op vs base │ sec/op vs base │ | MemcmpShort 183.12µ ± 0% 101.29µ ± 0% -44.68% (p=0.000 n=20) 39.96µ ± 0% -78.18% (p=0.000 n=20) | MemcmpMid 129.12µ ± 0% 24.55µ ± 1% -80.99% (p=0.000 n=20) 21.48µ ± 0% -83.37% (p=0.000 n=20) | MemcmpLong 111.374µ ± 0% 6.288µ ± 0% -94.35% (p=0.000 n=20) 7.593µ ± 0% -93.18% (p=0.000 n=20) | geomean 138.1µ 25.01µ -81.89% 18.68µ -86.47% | | │ memcmpScalar │ memcmpARM │ memcmpSIMD │ | │ B/s │ B/s vs base │ B/s vs base │ | MemcmpShort 651.0Mi ± 0% 1176.9Mi ± 0% +80.78% (p=0.000 n=20) 2983.1Mi ± 0% +358.23% (p=0.000 n=20) | MemcmpMid 923.3Mi ± 0% 4856.0Mi ± 1% +425.95% (p=0.000 n=20) 5551.0Mi ± 0% +501.23% (p=0.000 n=20) | MemcmpLong 1.045Gi ± 0% 18.514Gi ± 0% +1671.24% (p=0.000 n=20) 15.332Gi ± 0% +1366.84% (p=0.000 n=20) | geomean 863.3Mi 4.656Gi +452.24% 6.233Gi +639.33% `---- 4 References ============