Getz Mikalsen


Table of Contents
_________________

1. GSoC status update #2 (memcmp)
.. 1. What is memcmp even?
..... 1. Here be dragons
.. 2. Scalar compiler output
.. 3. amd64 SIMD implementation
2. Translating to Aarch64
.. 1. Stay within your page
.. 2. wide shifts
3. Benchmarks
4. References


1 GSoC status update #2 (memcmp)
================================

  14-06-2024

  <!-- markdown-toc start - Don't edit this section. Run M-x
  markdown-toc-refresh-toc -->

  *Table of Contents*

  - [GSoC status update #2 (memcmp)]
    - [What is memcmp even?]
      - [Here be dragons]
    - [Scalar compiler output]
    - [amd64 SIMD implementation]
  - [Translating to Aarch64]
    - [Stay within your page]
    - [wide shifts]
  - [Benchmarks]
  - [References]

  <!-- markdown-toc end -->


[GSoC status update #2 (memcmp)] See section 1

[What is memcmp even?] See section 1.1

[Here be dragons] See section 1.1.1

[Scalar compiler output] See section 1.2

[amd64 SIMD implementation] See section 1.3

[Translating to Aarch64] See section 2

[Stay within your page] See section 2.1

[wide shifts] See section 2.2

[Benchmarks] See section 3

[References] See section 4

1.1 What is memcmp even?
~~~~~~~~~~~~~~~~~~~~~~~~

  The man page on FreeBSD has the following to say:

  ,----
  | MEMCMP(3)              FreeBSD Library Functions Manual              MEMCMP(3)
  | 
  | NAME
  |      memcmp – compare byte string
  | 
  | LIBRARY
  |      Standard C Library (libc, -lc)
  | 
  | SYNOPSIS
  |      #include <string.h>
  | 
  |      int
  |      memcmp(const void *b1, const void *b2, size_t len);
  | 
  | DESCRIPTION
  |      The memcmp() function compares byte string b1 against byte string b2.
  |      Both strings are assumed to be len bytes long.
  | 
  | RETURN VALUES
  |      The memcmp() function returns zero if the two strings are identical,
  |      otherwise returns the difference between the first two differing bytes
  |      (treated as unsigned char values, so that ‘\200’ is greater than ‘\0’,
  |      for example).  Zero-length strings are always identical.
  | 
  | SEE ALSO
  |      bcmp(3), strcasecmp(3), strcmp(3), strcoll(3), strxfrm(3),
  |      timingsafe_memcmp(3), wmemcmp(3)
  | 
  | STANDARDS
  |      The memcmp() function conforms to ISO/IEC 9899:1990 (“ISO C90”).
  | 
  | FreeBSD 15.0-CURRENT            August 15, 2016           FreeBSD 15.0-CURRENT
  `----

  Now when starting off coding anything I like to begin with a minimum
  example, a PoC (Proof of Concept). So for these string functions I
  first make a small wrapper program to either call the libc function or
  my assembly function. Then I compare the output so get some quick
  turnaround times before integrating it into libc and running the
  proper unit tests.
  So I did just that, and to my surprise libc wasn't returning the
  expected values.


1.1.1 Here be dragons
---------------------

  So when I run my little wrapper program with the following:

  ,----
  | #include <stdio.h>
  | #include <string.h>
  | 
  | int main() {
  |     printf("%i\n", memcmp("\200", "\0", 1));
  | }
  `----

  It returns 1, thats odd. However, ISO/IEC 9899:1990 only specifies
  that a number greater than, equal to, or less than zero shall be
  returned. So going by the standard it's a valid return value. But it
  doesn't match up with what the FreeBSD man page states.

  But if we edit our program a bit like the following we notice
  something interesting.

  ,----
  | #include <stdio.h>
  | #include <string.h>
  | 
  | int main() {
  |   printf("%i\n", memcmp((char[] ) {"\200"}, "\0", 1));
  | }
  `----

  It returns `128', as expected!

  The reason for this is that on `Aarch64' we have inherited string
  functions from the `Arm Optimized Routines'. These functions aren't
  originally written with FreeBSD in mind and this has simply gone
  unnoticed. If we instead glance over at the manpage for memcmp on
  `glibc' we see the following regarding its return value:

  ,----
  | RETURN VALUE 
  | 
  |     The memcmp() function returns an integer less than, equal to, or greater than
  |     zero if the first n bytes of s1 is found, respectively, to be less than, to
  |     match, or be greater than the first n bytes of s2.
  | 
  |     For a nonzero return value, the sign is determined by the sign of the difference
  |     between the first pair of bytes (interpreted as unsigned char) that differ in s1
  |     and s2.
  | 
  |     If n is zero, the return value is zero.
  `----

  I strongly suspect that nobody is checking the return value beyond <,
  =, or >, except for the units tests... Which do fail on FreeBSD!

  <iframe src="./kyua-memcmp/index.html">

  </iframe>

  I generated this result page by using the `kyua' command `kyua
  report-html -r ${TEST}.db', everything you need to know about the
  FreeBSD test suite is available on the FreeBSD wiki.

  Finally, these inherited `Arm Optimized Routines' are located in
  `/usr/src/contrib/arm-optimized-routines' and plugged in to the string
  functions `Makefile.inc' like the following:

  ,----
  | AARCH64_STRING_FUNCS= \
  |     memchr \
  |     memcmp \
  |     memcpy \
  |     memmove \
  |     memrchr \
  |     memset \
  |     stpcpy \
  |     strchr \
  |     strchrnul \
  |     strcmp \
  |     strcpy \
  |     strlen \
  |     strncmp \
  |     strnlen \
  |     strrchr
  | 
  | #
  | # Add the above functions. Generate an asm file that includes the needed
  | # Arm Optimized Routines file defining the function name to the libc name.
  | # Some file need multiple macros defined or a weak symbol added we can
  | # override the generated file in these cases.
  | #
  | .for FUNC in ${AARCH64_STRING_FUNCS}
  | .if !exists(${FUNC}.S)
  | ${FUNC}.S:
  |     printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
  |     printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
  |     printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
  | CLEANFILES+=    ${FUNC}.S
  | .endif
  | 
  | MDSRCS+=    ${FUNC}.S
  | CFLAGS.${FUNC}.S+=-I${SRCTOP}/contrib/arm-optimized-routines/string
  | .endfor
  `----

  There we can swap out the faulty one for a compliant compiler
  generated implementation, but lets get going with porting ours!

  ,----
  | @@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
  |         strchrnul \
  |         strcmp \
  |         strcpy \
  | -       memcmp \
  |         strncmp \
  |         strnlen \
  |         strrchr
  | 
  | +MDSRCS+= \
  | +       memcmp.S
  |  #
  |  # Add the above functions. Generate an asm file that includes the needed
  |  # Arm Optimized Routines file defining the function name to the libc name.
  `----

  These functions already have ARM NEON enchancements from the Arm
  Optimized Routines

  ,----
  | memchr             |   x
  | memcmp             |   x
  | memcpy             |   SWAR
  | memmove            |   SWAR
  | memrchr            |   x
  | memset             |   x
  | stpcpy             |   x
  | strchr             |   x
  | strchrnul          |   x
  | strcmp             |   SWAR
  | strcpy             |   x
  | strlen             |   x
  | strncmp            |   SWAR
  | strnlen            |   x
  | strrchr            |   x
  `----

  Now I wont be touching memcpy as its implementation could benefit from
  another design [[1]].  But if we ignore the functions with
  pre-existing implementations by ARM we are left with the following
  list. But lets finish off `memcmp' first because we already got
  started on it and maybe we can beat Arm at their own game.

  ,----
  | FUNCTION          AMD64
  | bcmp              S1
  | index             S1
  | memccpy           S1
  | rindex            S1
  | stpncpy           S1
  | strcat            S1
  | strcmp            S1
  | strcspn           S2
  | strlcat           S1
  | strlcpy           S1
  | strncat           S1
  | strncmp           S1
  | strncpy           S1
  | strpbrk           S2
  | strsep            S2
  | strspn            S2
  | timingsafe_bcmp   S1
  `----


[[1]
<https://research.google/pubs/automemcpy-a-framework-for-automatic-generation-of-fundamental-memory-operations/>


1.2 Scalar compiler output
~~~~~~~~~~~~~~~~~~~~~~~~~~

  After the above small hiccup I took a look at the `.c' code for memcmp
  and its generated assembly. Nothing particularly notable here

  <iframe
  src="<https://godbolt.org/e#g>:!((g:!((g:!((h:codeEditor,i:(filename:&#39;1&#39;,fontScale:14,fontUsePx:&#39;0&#39;,j:1,lang:___c,selection:(endColumn:4,endLineNumber:6,positionColumn:4,positionLineNumber:6,selectionStartColumn:4,selectionStartLineNumber:6,startColumn:4,startLineNumber:6),source:&#39;%23include+%3Cstring.h%3E%0A%0A/*%0A+*+Compare+memory+regions.%0A+*/%0Aint%0Amemcmp(const+void+*s1,+const+void+*s2,+size_t+n)%0A%7B%0A%09if+(n+!!%3D+0)+%7B%0A%09%09const+unsigned+char+*p1+%3D+s1,+*p2+%3D+s2%3B%0A%0A%09%09do+%7B%0A%09%09%09if+(*p1%2B%2B+!!%3D+*p2%2B%2B)%0A%09%09%09%09return+(*--p1+-+*--p2)%3B%0A%09%09%7D+while+(--n+!!%3D+0)%3B%0A%09%7D%0A%09return+(0)%3B%0A%7D%0A&#39;),l:&#39;5&#39;,n:&#39;1&#39;,o:&#39;C+source+%231&#39;,t:&#39;0&#39;)),k:50,l:&#39;4&#39;,n:&#39;0&#39;,o:&#39;&#39;,s:0,t:&#39;0&#39;),(g:!((h:compiler,i:(compiler:armv8-cclang1810,filters:(b:&#39;0&#39;,binary:&#39;1&#39;,binaryObject:&#39;1&#39;,commentOnly:&#39;0&#39;,debugCalls:&#39;1&#39;,demangle:&#39;0&#39;,directives:&#39;0&#39;,execute:&#39;1&#39;,intel:&#39;0&#39;,libraryCode:&#39;0&#39;,trim:&#39;1&#39;,verboseDemangling:&#39;0&#39;),flagsViewOpen:&#39;1&#39;,fontScale:14,fontUsePx:&#39;0&#39;,j:1,lang:___c,libs:!(),options:&#39;-O3&#39;,overrides:!(),selection:(endColumn:12,endLineNumber:16,positionColumn:12,positionLineNumber:16,selectionStartColumn:12,selectionStartLineNumber:16,startColumn:12,startLineNumber:16),source:1),l:&#39;5&#39;,n:&#39;0&#39;,o:&#39;+armv8-a+clang+18.1.0+(Editor+%231)&#39;,t:&#39;0&#39;)),k:50,l:&#39;4&#39;,n:&#39;0&#39;,o:&#39;&#39;,s:0,t:&#39;0&#39;)),l:&#39;2&#39;,n:&#39;0&#39;,o:&#39;&#39;,t:&#39;0&#39;)),version:4">

  </iframe>

  <figure>

  <figcaption>

  memcmp assembly listing

  </figcaption>

  </figure>


1.3 amd64 SIMD implementation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Now this is a long one

  ,----
  | ARCHENTRY(memcmp, baseline)
  |     cmp     $32, %rdx       # enough to permit use of the long kernel?
  |     ja      .Llong
  | 
  |     test        %rdx, %rdx      # zero bytes buffer?
  |     je      .L0
  | 
  |     /*
  |      * Compare strings of 1--32 bytes.  We want to do this by
  |      * loading into two xmm registers and then comparing.  To avoid
  |      * crossing into unmapped pages, we either load 32 bytes from
  |      * the start of the buffer or 32 bytes before its end, depending
  |      * on whether there is a page boundary between the overread area
  |      * or not.
  |      */
  | 
  |     /* check for page boundaries overreads */
  |     lea     31(%rdi), %eax      # end of overread
  |     lea     31(%rsi), %r8d
  |     lea     -1(%rdi, %rdx, 1), %ecx # last character in buffer
  |     lea     -1(%rsi, %rdx, 1), %r9d
  |     xor     %ecx, %eax
  |     xor     %r9d, %r8d
  |     test        $PAGE_SIZE, %eax    # are they on different pages?
  |     jz      0f
  | 
  |     /* fix up rdi */
  |     movdqu      -32(%rdi, %rdx, 1), %xmm0
  |     movdqu      -16(%rdi, %rdx, 1), %xmm1
  |     lea     -8(%rsp), %rdi      # end of replacement buffer
  |     sub     %rdx, %rdi      # start of replacement buffer
  |     movdqa      %xmm0, -40(%rsp)    # copy to replacement buffer
  |     movdqa      %xmm1, -24(%rsp)
  | 
  | 0:  test        $PAGE_SIZE, %r8d
  |     jz      0f
  | 
  |     /* fix up rsi */
  |     movdqu      -32(%rsi, %rdx, 1), %xmm0
  |     movdqu      -16(%rsi, %rdx, 1), %xmm1
  |     lea     -40(%rsp), %rsi     # end of replacement buffer
  |     sub     %rdx, %rsi      # start of replacement buffer
  |     movdqa      %xmm0, -72(%rsp)    # copy to replacement buffer
  |     movdqa      %xmm1, -56(%rsp)
  | 
  |     /* load data and compare properly */
  | 0:  movdqu      16(%rdi), %xmm1
  |     movdqu      16(%rsi), %xmm3
  |     movdqu      (%rdi), %xmm0
  |     movdqu      (%rsi), %xmm2
  |     mov     %edx, %ecx
  |     mov     $-1, %edx
  |     shl     %cl, %rdx       # ones where the buffer is not
  |     pcmpeqb     %xmm3, %xmm1
  |     pcmpeqb     %xmm2, %xmm0
  |     pmovmskb    %xmm1, %ecx
  |     pmovmskb    %xmm0, %eax
  |     shl     $16, %ecx
  |     or      %ecx, %eax      # ones where the buffers match
  |     or      %edx, %eax      # including where the buffer is not
  |     not     %eax            # ones where there is a mismatch
  | #ifndef BCMP
  |     bsf     %eax, %edx      # location of the first mismatch
  |     cmovz       %eax, %edx      # including if there is no mismatch
  |     movzbl      (%rdi, %rdx, 1), %eax   # mismatching bytes
  |     movzbl      (%rsi, %rdx, 1), %edx
  |     sub     %edx, %eax
  | #endif
  |     ret
  | 
  |     /* empty input */
  | .L0:    xor     %eax, %eax
  |     ret
  | 
  |     /* compare 33+ bytes */
  |     ALIGN_TEXT
  | .Llong: movdqu      (%rdi), %xmm0       # load head
  |     movdqu      (%rsi), %xmm2
  |     mov     %rdi, %rcx
  |     sub     %rdi, %rsi      # express rsi as distance from rdi
  |     and     $~0xf, %rdi     # align rdi to 16 bytes
  |     movdqu      16(%rsi, %rdi, 1), %xmm1
  |     pcmpeqb     16(%rdi), %xmm1     # compare second half of this iteration
  |     add     %rcx, %rdx      # pointer to last byte in buffer
  |     jc      .Loverflow      # did this overflow?
  | 0:  pcmpeqb     %xmm2, %xmm0
  |     pmovmskb    %xmm0, %eax
  |     xor     $0xffff, %eax       # any mismatch?
  |     jne     .Lmismatch_head
  |     add     $64, %rdi       # advance to next iteration
  |     jmp     1f          # and get going with the loop
  | 
  |     /*
  |      * If we got here, a buffer length was passed to memcmp(a, b, len)
  |      * such that a + len < a.  While this sort of usage is illegal,
  |      * it is plausible that a caller tries to do something like
  |      * memcmp(a, b, SIZE_MAX) if a and b are known to differ, intending
  |      * for memcmp() to stop comparing at the first mismatch.  This
  |      * behaviour is not guaranteed by any version of ISO/IEC 9899,
  |      * but usually works out in practice.  Let's try to make this
  |      * case work by comparing until the end of the address space.
  |      */
  | .Loverflow:
  |     mov     $-1, %rdx       # compare until the end of memory
  |     jmp     0b
  | 
  |     /* process buffer 32 bytes at a time */
  |     ALIGN_TEXT
  | 0:  movdqu      -32(%rsi, %rdi, 1), %xmm0
  |     movdqu      -16(%rsi, %rdi, 1), %xmm1
  |     pcmpeqb     -32(%rdi), %xmm0
  |     pcmpeqb     -16(%rdi), %xmm1
  |     add     $32, %rdi       # advance to next iteration
  | 1:  pand        %xmm0, %xmm1        # 0xff where both halves matched
  |     pmovmskb    %xmm1, %eax
  |     cmp     $0xffff, %eax       # all bytes matched?
  |     jne     .Lmismatch
  |     cmp     %rdx, %rdi      # end of buffer reached?
  |     jb      0b
  | 
  |     /* less than 32 bytes left to compare */
  |     movdqu      -16(%rdx), %xmm1    # load 32 byte tail through end pointer
  |     movdqu      -16(%rdx, %rsi, 1), %xmm3
  |     movdqu      -32(%rdx), %xmm0
  |     movdqu      -32(%rdx, %rsi, 1), %xmm2
  |     pcmpeqb     %xmm3, %xmm1
  |     pcmpeqb     %xmm2, %xmm0
  |     pmovmskb    %xmm1, %ecx
  |     pmovmskb    %xmm0, %eax
  |     shl     $16, %ecx
  |     or      %ecx, %eax      # ones where the buffers match
  |     not     %eax            # ones where there is a mismatch
  | #ifndef BCMP
  |     bsf     %eax, %ecx      # location of the first mismatch
  |     cmovz       %eax, %ecx      # including if there is no mismatch
  |     add     %rcx, %rdx      # pointer to potential mismatch
  |     movzbl      -32(%rdx), %eax     # mismatching bytes
  |     movzbl      -32(%rdx, %rsi, 1), %edx
  |     sub     %edx, %eax
  | #endif
  |     ret
  | 
  | #ifdef BCMP
  | .Lmismatch:
  |     mov     $1, %eax
  | .Lmismatch_head:
  |     ret
  | #else /* memcmp */
  | .Lmismatch_head:
  |     tzcnt       %eax, %eax      # location of mismatch
  |     add     %rax, %rcx      # pointer to mismatch
  |     movzbl      (%rcx), %eax        # mismatching bytes
  |     movzbl      (%rcx, %rsi, 1), %ecx
  |     sub     %ecx, %eax
  |     ret
  | 
  | .Lmismatch:
  |     movdqu      -48(%rsi, %rdi, 1), %xmm1
  |     pcmpeqb     -48(%rdi), %xmm1    # reconstruct xmm1 before PAND
  |     pmovmskb    %xmm0, %eax     # mismatches in first 16 bytes
  |     pmovmskb    %xmm1, %edx     # mismatches in second 16 bytes
  |     shl     $16, %edx
  |     or      %edx, %eax      # mismatches in both
  |     not     %eax            # matches in both
  |     tzcnt       %eax, %eax      # location of mismatch
  |     add     %rax, %rdi      # pointer to mismatch
  |     movzbl      -64(%rdi), %eax     # mismatching bytes
  |     movzbl      -64(%rdi, %rsi, 1), %ecx
  |     sub     %ecx, %eax
  |     ret
  | #endif
  | ARCHEND(memcmp, baseline)
  `----

  The following table might be helpful when reading unless the x86
  registers are already ingrained in your head. :-)

  <table style="font-family: &#39;Times New Roman&#39;; font-size:
        medium; font-style: normal; font-variant: normal; font-weight:
        normal; letter-spacing: normal; line-height: normal; orphans:
        auto; text-align: start; text-indent: 0px;" cellspacing="2"
        cellpadding="2" border="1">

  <tbody>

  <tr>

  <td valign="top">

  Name

  </td>

  <td style="vertical-align: top; font-weight: bold;">

  Notes

  </td>

  <td valign="top">

  Type

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0); font-weight:
  	    bold;">

  64-bit long

  </td>

  <td style="vertical-align: top; font-weight: bold;">

  32-bit int

  </td>

  <td style="vertical-align: top; font-weight: bold;">

  16-bit short

  </td>

  <td style="vertical-align: top; font-weight: bold;">

  8-bit char

  </td>

  </tr>

  <tr>

  <td valign="top">

  rax

  </td>

  <td style="vertical-align: top;">

  Values are returned from functions in this register. 

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rax

  </td>

  <td style="vertical-align: top;">

  eax

  </td>

  <td style="vertical-align: top;">

  ax

  </td>

  <td style="vertical-align: top;">

  ah and al

  </td>

  </tr>

  <tr>

  <td valign="top">

  rcx

  </td>

  <td style="vertical-align: top;">

  Typical scratch register.  Some instructions also use it as a counter.

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rcx

  </td>

  <td style="vertical-align: top;">

  ecx

  </td>

  <td style="vertical-align: top;">

  cx

  </td>

  <td style="vertical-align: top;">

  ch and cl

  </td>

  </tr>

  <tr>

  <td valign="top">

  rdx

  </td>

  <td style="vertical-align: top;">

  Scratch register. Function argument #3 in 64-bit Linux

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rdx

  </td>

  <td style="vertical-align: top;">

  edx

  </td>

  <td style="vertical-align: top;">

  dx

  </td>

  <td style="vertical-align: top;">

  dh and dl

  </td>

  </tr>

  <tr>

  <td valign="top">

  rbx

  </td>

  <td style="vertical-align: top;">

  Preserved register: don't use it without saving it!

  </td>

  <td valign="top">

  preserved

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rbx

  </td>

  <td style="vertical-align: top;">

  ebx

  </td>

  <td style="vertical-align: top;">

  bx

  </td>

  <td style="vertical-align: top;">

  bh and bl

  </td>

  </tr>

  <tr>

  <td valign="top">

  rsp

  </td>

  <td style="vertical-align: top;">

  The stack pointer.  Points to the top of the stack

  </td>

  <td valign="top">

  preserved

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rsp

  </td>

  <td style="vertical-align: top;">

  esp

  </td>

  <td style="vertical-align: top;">

  sp

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  spl

  </td>

  </tr>

  <tr>

  <td valign="top">

  rbp

  </td>

  <td style="vertical-align: top;">

  Preserved register.  Sometimes used to store the old value of the
  stack pointer, or the "base".

  </td>

  <td valign="top">

  preserved

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rbp

  </td>

  <td style="vertical-align: top;">

  ebp

  </td>

  <td style="vertical-align: top;">

  bp

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  bpl

  </td>

  </tr>

  <tr>

  <td valign="top">

  rsi

  </td>

  <td style="vertical-align: top;">

  Scratch register.  Also used to pass function argument #2 in 64-bit
  Linux

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rsi

  </td>

  <td style="vertical-align: top;">

  esi

  </td>

  <td style="vertical-align: top;">

  si

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  sil

  </td>

  </tr>

  <tr>

  <td valign="top">

  rdi

  </td>

  <td style="vertical-align: top;">

  Scratch register.  Function argument #1 in 64-bit Linux

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  rdi

  </td>

  <td style="vertical-align: top;">

  edi

  </td>

  <td style="vertical-align: top;">

  di

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  dil

  </td>

  </tr>

  <tr>

  <td valign="top">

  r8

  </td>

  <td style="vertical-align: top;">

  Scratch register.  These were added in 64-bit mode, so they have
  numbers, not names.

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r8

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r8d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r8w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r8b

  </td>

  </tr>

  <tr>

  <td valign="top">

  r9

  </td>

  <td style="vertical-align: top;">

  Scratch register.

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r9

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r9d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r9w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r9b

  </td>

  </tr>

  <tr>

  <td valign="top">

  r10

  </td>

  <td style="vertical-align: top;">

  Scratch register.

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r10

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r10d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r10w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r10b

  </td>

  </tr>

  <tr>

  <td valign="top">

  r11

  </td>

  <td style="vertical-align: top;">

  Scratch register.

  </td>

  <td valign="top">

  scratch

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r11

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r11d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r11w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r11b

  </td>

  </tr>

  <tr>

  <td valign="top">

  r12

  </td>

  <td style="vertical-align: top;">

  Preserved register.  You can use it, but you need to save and restore
  it.

  </td>

  <td valign="top">

  preserved

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r12

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r12d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r12w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r12b

  </td>

  </tr>

  <tr>

  <td valign="top">

  r13

  </td>

  <td style="vertical-align: top;">

  Preserved register.

  </td>

  <td valign="top">

  preserved

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r13

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r13d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r13w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r13b

  </td>

  </tr>

  <tr>

  <td valign="top">

  r14

  </td>

  <td style="vertical-align: top;">

  Preserved register.

  </td>

  <td valign="top">

  preserved

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r14

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r14d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r14w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r14b

  </td>

  </tr>

  <tr>

  <td valign="top">

  r15

  </td>

  <td style="vertical-align: top;">

  Preserved register.

  </td>

  <td valign="top">

  preserved

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r15

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r15d

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r15w

  </td>

  <td style="vertical-align: top; color: rgb(255, 0, 0);">

  r15b

  </td>

  </tr>

  </tbody>

  </table>

  <figure>

  <figcaption>

  X86 Registers courtesy of [2]

  </figcaption>

  </figure>

  Luckily this code is well commented, but in short it does the
  following:

  - Checks if we are checking more than 32 bytes

  - If not it checks if the bytes we are about to check will cross a
    page boundary as we check 32 bytes simultaneously. For example if we
    are to check 20 bytes but the page boundary is at 25 bytes from the
    beginning of one of our inputs.

  - If the above is the case we copy the data to a replacement/bounce
    buffer and override rdi/rsi.

  - If no funny business is required we simply start comparing data in a
    similar way as to `strlen' except now we have a mask where we check
    if everything is matching or not. So we look for the occurence of
    the first 00 rather than FF as we did in `strlen.

  - Then once we do hit we load the word at the offset from our base
    address and subtract them before we return


[2]
<https://www.cs.uaf.edu/2017/fall/cs301/lecture/09_11_registers.html>


2 Translating to Aarch64
========================

  To begin with we ignore the special case regarding page crossing and
  save it for last, like a dessert. We start loading registers pairwise,
  so 32 bytes are immediately loaded. Then we start comparing using cmeq
  like we did with `strlen' except that we compared with `#0' there.

  then we need to reduce the width so the result can fit in a GPR, we
  use `shrn' for that, just as we did before.

  But for matches the result is `FF', so now the register is all `FF''s
  if they were identical and if they were identical we want to keep
  looping.  But the problem is that we can't do a floating point compare
  with `FF' as its `NaN', therefore we move it to a GPR and invert
  it. Inverting it also allows us to count the zeroes and if we have 32
  zeroes (`FF''s before inverting) we had no matches and we can keep
  looking! To make sure we dont look for too long we do a comparison
  with the max limit first, if we are past our limit and no matches were
  found then we simply return 0.

  ,----
  | .Lbegin:
  |     mov x8,x0           // store base address for later
  |     mov x9,x1
  | 
  |     ldp q0,q1,[x0]      //load 32 bytes of b1 into vector registers
  |     ldp q2,q3,[x1]
  | 
  |     cmeq    v0.16b,v0.16b,v2.16b    // do compare between 0-15 b1 vs b2
  |     shrn    v0.8b,v0.8h,#4      // shift them right to fit in x1
  |     cmeq    v1.16b,v1.16b,v3.16b    // do compare between 16-31 b1 vs b2
  |     shrn    v1.8b,v1.8h,#4
  | 
  |     fmov    x1,d0
  |     fmov    x3,d1
  | 
  |     mvn x0,x1           // invert to use clz
  |     cbz x0,0f
  |     rbit    x0,x0
  |     clz x0,x0           // if this is zero check bytes 16..32
  |     b   1f
  |     
  | 0:  
  |     rbit    x1,x3
  |     mvn x1,x1
  |     clz x0,x1
  |     add x0,x0,#64
  | 1:  
  |     lsr x0,x0,#2
  |     cmp x0,limit        // if we have more matches than the limit return 0
  |     b.ge    .Lnone  
  |     cmp x0,#32          // x0 == 32 if no hit (32 0's)
  |     b.eq    .Lloop
  | 2:  
  |     ldrb    w4,[x8,x0]      // gonna need to edit this based on align offset
  |     ldrb    w5,[x9,x0]  
  |     subs    w0,w4,w5        // get the byte difference
  |     ret
  `----

  Now this approach works for up to 32 byte comparisons but we want to
  turn it into a loop, its quite straight forward with the way the code
  was written. Simpy make the loads increment the pointer and subtract
  from the limit.

  ,----
  | subs    limit,limit,#32
  | b.lt    .Lnone          //we could do special handling here
  | ldp q0,q1,[x8,32]!      //load 32 bytes of b1 into vector registers
  | ldp q2,q3,[x9,32]!
  `----

  Now the problem with this approach is that the loop condition is
  buried deep within a code path so even with speculative execution etc
  it will struggle.

  Now we could try doing something like this:

  ,----
  | .Lloop:
  |     subs    limit,limit,#32
  |     b.lt    .Lnone          //we could do special handling here
  |     ldp     q0,q1,[x8,32]!      //load 32 bytes of b1 into vector registers
  |     ldp     q2,q3,[x9,32]!
  |     
  |     cmeq    v0.16b,v0.16b,v2.16b    // do compare between 0-15 b1 vs b2
  |     cmeq    v1.16b,v1.16b,v3.16b    // do compare between 16-32 b1 vs b2
  | 
  |     umin    v2.16b,v0.16b,v1.16b
  |     umin    v2.16b,v2.16b,v2.16b    // separate lowest value    
  |     NOT     v2.16b,v2.16b           // invert it (FF -> we want to loop)    
  |     shrn    v2.8b,v2.8h,#4
  |     fcmp    d2,#0.0
  |     b.eq    .Lloop
  `----

  But benchmarks showed that this resulted in better performance but I
  tried a few other ways to tighten the loop that turned out to be even
  faster.

  ,----
  | .Lloop:
  |     subs    limit,limit,#32
  |     b.lt    .Lnone          //we could do special handling here
  |     ldp q0,q1,[x8,32]!      //load 32 bytes of b1 into vector registers
  |     ldp q2,q3,[x9,32]!
  |     
  |     eor v4.16b,v0.16b,v2.16b    // v0 = b1(16-32) XOR b2(16-32)
  |     eor v5.16b,v1.16b,v3.16b    // v1 = b1(32-48) XOR b2(32-48)
  |     umaxp   v4.16b,v4.16b,v5.16b    // v0 = max(v0,v1)
  |     umaxp   v4.16b,v4.16b,v4.16b    // v0 = max(v0,v0)
  |     fmov    tmp, d4
  |     cbz tmp, .Lloop
  | 
  |     cmeq    v0.16b,v0.16b,v2.16b    // do compare between 0-15 b1 vs b2  
  |     cmeq    v1.16b,v1.16b,v3.16b    // do compare between 16-32 b1 vs b2
  `----

  Despite not reusing the result from cmeq this resulted in a few
  percent better perfomance across several Arm64 cores. This could be
  attributed to `cmeq' and `shrn' utilizing fewer pipelines than `eor'
  and `umaxp'.  This is shown for the Neoverse V2 core the the
  optimization guide.  [[3]]

  A nice thing I realized is that this quick check is also applicable
  for the first 32 byte iteration aswell. Adding it there resulted in
  almost a 50% improvement for short string using the now familiar suite
  of benchmarks.

  <figure>

  <figcaption>

  Aarch64 registers.

  </figcaption>

  </figure>


[[3] <https://developer.arm.com/documentation/109898/latest/>

2.1 Stay within your page
~~~~~~~~~~~~~~~~~~~~~~~~~

  Now to the dessert, the crossing into unmapped pages.

  We put our memcmp through the unit tests as previously described and
  we get the following.

  <iframe src="./kyua-memcmp-fixed/index.html">

  </iframe>

  One failed! But why? And it's not even a test of memcmp!

  If we look into the log we notice the following

  ,----
  | Process with PID 9597 exited with signal 11 and dumped core; attempting to gather stack trace
  | [New LWP 120611]
  | Core was generated by `/usr/obj/home/getz/gsoc/freebsd-src/arm64.aarch64/lib/libc/tests/string/checkdir'.
  | Program terminated with signal SIGSEGV, Segmentation fault.
  | Address not mapped to object.
  | #0  memcmp () at /home/getz/gsoc/freebsd-src/lib/libc/aarch64/string/memcmp.S:78
  | 78    ldp q2,q3,[x1] //load 32 bytes of b2 into vector registers
  `----

  So it's a `SIGSEV' due to a address not being mapped, i.e. an unmapped
  page.

  If we now run it through gdb (or lldb) we see the following values for
  `x0' and `x1' right before the attempted load.

  ,----
  | >>> p/tz  $x0
  | $2 = 0x0000000041035000
  | >>> p/tz $x1
  | $3 = 0x0000000041036fff
  `----

  To handle this we can first make check if we are allowed to check more
  than 32 bytes (limit > 32). If not we can check how many bytes are
  left to the next page crossing. If its less than 32 bytes we can load
  our 32 bytes from the buffers pointer minus the offset. Then after its
  loaded we can shift the entire vector register to get rid of the
  unwanted bytes.

  On SSE there is a `PSHUFB' instruction that does this and luckily
  there is a more or less equivalent instruction for Aarch64 `TBL'. We
  need to use TBL as the other shift instructions like `EXT' and `USHL'
  either dont work with variable shifts or across entire vector
  registers. `USHL' would shift each byte a certain amount depending on
  the arrangement specifier (the .16b piece you often see).

  ----------------------------------------------------------------------

  The numbers followed by a letter for the vector registers are the
  arrangement specifiers is the number and size of the elements in the
  vector. A simple way to keep track of them is by thinking:

  - B = Bytes (8bit)

  - H = Halfwords (16bit)

  - S = Single words (32bit)

  - D = Double words (64bit)

  courtesy of [[4]]

  So for checking if we are about to cross a page we can do the
  following

  ,----
  | cmp     x2,#32
  | b.hi    .Lbegin    // Jump to regular loop if we are allowed to check more than 32 bytes
  | add     x3,x8,#32
  | add     x4,x9,#32
  | eor     x3,x3,x8
  | eor     x4,x4,x9
  | tst     w3,#PAGE_SIZE
  | b.ne    .Lspecialhandling
  | tst     w4,#PAGE_SIZE
  | b.ne    .Lspecialhandling
  `----

  Then theres also a piece that was overlooked in the main loop. As
  neither buf1 or buf2 is aligned there is a chance that we may overread
  when theres less than 32 bytes left to compare. This is easily solved
  by adding a new branch where we previously checked if the limit had
  gone down to 0.

  ,----
  | .Lloop:
  |     subs    x2,x2,#32
  |     b.le    .Lnone
  |     cmp x2,#32
  |     b.le    .Llast32
  |     ldp q0,q1,[x8,32]!      // load 32 bytes to vector registers
  |     ldp q2,q3,[x9,32]!
  | 
  |     eor v4.16b,v0.16b,v2.16b
  |     eor v5.16b,v1.16b,v3.16b
  |     umaxp   v4.16b,v4.16b,v5.16b
  |     umaxp   v4.16b,v4.16b,v4.16b
  |     fmov    x6,d4
  |     cbz x6,.Lloop
  |     b   .Lmatch
  | 
  |     /* If 32 bytes left to compare only load 32 bytes from x8 - limit to
  |      * avoid overread */
  | .Llast32:
  |     mov x3,#32
  |     sub x3,x3,x2        // x3 = 32 - x2
  |     add x2,x2,x3        // add the amount we shifted to limit
  |     sub x8,x8,x3
  |     sub x9,x9,x3
  | 
  |     ldp q0,q1,[x8,#32]! // load 32 bytes to vector registers
  |     ldp q2,q3,[x9,#32]!
  | 
  | .Lmatch:
  |     cmeq    v0.16b,v0.16b,v2.16b    // compare between 0-15 b1 vs b2
  |     cmeq    v1.16b,v1.16b,v3.16b    // compare between 16-31 b1 vs b2
  `----


[[4] <https://stackoverflow.com/a/73293215/17418180>


2.2 wide shifts
~~~~~~~~~~~~~~~

  So now the only piece left in the puzzle is to shift the vector
  register for the case when we have 32 bytes or less to compare and
  they are located near the end of a page.
  To utilize `TBL' here we need to generate a suitable permutation mask,
  to do this we can take the identity permutation `0, 1, 2, ..., 15, 16,
  17, ..., 31' and add to each byte to rotate it, or load from it with a
  desired offset. This can table can easily be prepared in a data
  segment.

  ,----
  |     .section .rodata
  |     .p2align 4
  | shift_table:
  |     .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
  |     .byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
  `----

  `USHR/USHL' could also work but as it can only shift 64 bit words it
  would only be useful for very short strings unless there some nice way
  to utilize it here.

  The `TBL/TBL' instructions are rather tricky to figure out how they
  work based on the reference manual but ARM provides a graphical
  representation of how they operate which cleared some things out for
  me.

  <figure>

  <figcaption>

  TBL Lookup table instruction. courtesy of [5] (click to enlarge)

  </figcaption>

  </figure>

  We simply check if the 32nd byte from our buffers starting address is
  located on another page as defined by the `$PAGE_SIZE' preprocessor
  directive defined in `sys/arm64/include/param.h'. Then we shift and
  load by the appropriate amount.

  So the check and fix can be implemented like the following:

  ,----
  | ENTRY(_memcmp)
  | 
  |     mov x8, x0
  |     mov x9, x1
  |     cbz x2, .Lnone      // 0 length
  | 
  |     /*
  |      * Check if buffer is located at end of page to avoid crossing
  |      * into unmapped page. If so, we load 32 bytes from the end of the
  |      * limit and check the other buffer.
  |      */
  | 
  |     cmp x2, #32
  |     b.hi    .Lbegin
  |     add x3, x8, #32
  |     add x4, x9, #32
  |     eor x3, x3, x8
  |     eor x4, x4, x9
  | 
  |     tst w3, #PAGE_SIZE
  |     b.eq    0f
  | 
  |     mov x3, #32
  |     sub x3, x3, x2
  |     sub x8, x8, x3
  | 
  |     /*
  |      * We perform a variable shift in the vector register using TBL,
  |      * a suitable permutation is generated by loading a table of bytes
  |      * with a desired offset.
  |      */
  | 
  |     adrp    x0, shift_table
  |     add x0, x0, :lo12:shift_table
  |     add x0, x0, x3
  |     ldp q0, q1, [x8]
  |     ldp q4, q5, [x0]        // load permutation table
  |     tbl v0.16b, {v0.16b, v1.16b}, v4.16b
  |     tbl v1.16b, {v0.16b, v1.16b}, v5.16b
  |     add x8, x8, x3      // reset pointer to beginning of src
  |     b   1f
  | 
  | 0:
  |     ldp q0, q1, [x8]
  | 
  | 1:
  |     tst w4, #PAGE_SIZE
  |     b.eq    0f
  | 
  |     mov x3, #32
  |     sub x3, x3, x2
  |     sub x9, x9, x3
  | 
  |     ldp q2, q3, [x9]
  |     adrp    x0, shift_table
  |     add x0, x0, :lo12:shift_table
  |     add x0, x0, x3
  |     ldp q4, q5, [x0]
  |     tbl v2.16b, {v2.16b, v3.16b}, v4.16b
  |     tbl v3.16b, {v2.16b, v3.16b}, v5.16b
  |     add x9, x9, x3
  |     b   1f
  | 
  |     /*
  |      * Compare strings of 1--32 bytes.  We do this by loading into two
  |      * vector registers and then doing a quick compare with XOR,  UMAXP
  |      * do determine if the first 32 bytes all match.
  |      */
  | .Lbegin:
  |     ldp q0, q1, [x8]
  | 0:
  |     ldp q2, q3, [x9]
  | 1:
  | 
  |     /*
  |      * REST OF CODE HERE
  |      */
  | 
  | END(_memcmp)
  | 
  |     .section .rodata
  |     .p2align 4
  | shift_table:
  |     .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
  |     .byte 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
  `----


3 Benchmarks
============

  Collin Percival was kind enough to give me a voucher for AWS credits
  so now I can also benchmark on a more powerful ARM core which might be
  more common to find running a FreeBSD system.

  ,----
  | os: FreeBSD
  | arch: arm64
  | cpu: ARM Neoverse-V1 r1p1
  |             │ memcmpScalar │              memcmpARM              │             memcmpSIMD              │
  |             │    sec/op    │   sec/op     vs base                │   sec/op     vs base                │
  | MemcmpShort   139.41µ ± 2%   63.96µ ± 1%  -54.12% (p=0.000 n=20)   25.14µ ± 1%  -81.97% (p=0.000 n=20)
  | MemcmpMid      93.38µ ± 3%   12.09µ ± 1%  -87.05% (p=0.000 n=20)   10.91µ ± 1%  -88.31% (p=0.000 n=20)
  | MemcmpLong    86.722µ ± 7%   4.648µ ± 1%  -94.64% (p=0.000 n=20)   4.931µ ± 0%  -94.31% (p=0.000 n=20)
  | geomean        104.1µ        15.32µ       -85.29%                  11.06µ       -89.38%
  | 
  |             │ memcmpScalar │                memcmpARM                │               memcmpSIMD                │
  |             │     B/s      │      B/s       vs base                  │      B/s       vs base                  │
  | MemcmpShort   855.1Mi ± 2%   1864.0Mi ± 1%   +117.99% (p=0.000 n=20)   4742.2Mi ± 1%   +454.58% (p=0.000 n=20)
  | MemcmpMid     1.247Gi ± 3%    9.629Gi ± 1%   +671.89% (p=0.000 n=20)   10.668Gi ± 1%   +755.20% (p=0.000 n=20)
  | MemcmpLong    1.342Gi ± 6%   25.048Gi ± 1%  +1765.92% (p=0.000 n=20)   23.608Gi ± 0%  +1658.68% (p=0.000 n=20)
  | geomean       1.118Gi         7.600Gi        +579.66%                   10.53Gi        +841.33%
  | 
  | os: FreeBSD
  | arch: arm64
  | cpu: ARM Cortex-A76 r4p1
  |             │ memcmpScalar  │              memcmpARM               │             memcmpSIMD              │
  |             │    sec/op     │    sec/op     vs base                │   sec/op     vs base                │
  | MemcmpShort    183.12µ ± 0%   101.29µ ± 0%  -44.68% (p=0.000 n=20)   39.96µ ± 0%  -78.18% (p=0.000 n=20)
  | MemcmpMid      129.12µ ± 0%    24.55µ ± 1%  -80.99% (p=0.000 n=20)   21.48µ ± 0%  -83.37% (p=0.000 n=20)
  | MemcmpLong    111.374µ ± 0%    6.288µ ± 0%  -94.35% (p=0.000 n=20)   7.593µ ± 0%  -93.18% (p=0.000 n=20)
  | geomean         138.1µ         25.01µ       -81.89%                  18.68µ       -86.47%
  | 
  |             │ memcmpScalar │                memcmpARM                │               memcmpSIMD                │
  |             │     B/s      │      B/s       vs base                  │      B/s       vs base                  │
  | MemcmpShort   651.0Mi ± 0%   1176.9Mi ± 0%    +80.78% (p=0.000 n=20)   2983.1Mi ± 0%   +358.23% (p=0.000 n=20)
  | MemcmpMid     923.3Mi ± 0%   4856.0Mi ± 1%   +425.95% (p=0.000 n=20)   5551.0Mi ± 0%   +501.23% (p=0.000 n=20)
  | MemcmpLong    1.045Gi ± 0%   18.514Gi ± 0%  +1671.24% (p=0.000 n=20)   15.332Gi ± 0%  +1366.84% (p=0.000 n=20)
  | geomean       863.3Mi         4.656Gi        +452.24%                   6.233Gi        +639.33%
  `----


4 References
============