GSoC status update #4 (A bit of everything)

14-07-2024

Table of Contents

Whats been going on this week

This week I did some finishing touches on strncmp, it’s essentially the same as strcmp except with special handling for the limit and breaks the main loop when theres less than 32 bytes left of the limit.

I also got some good code review on str(n)cmp and applied it which resulted in a few % performance improvement. [1]

strncmp pains

I thought I could do a clever solution for handling strings near a page boundary for short strings (less than 16 bytes) for strncmp but I was mistaken. After updating the test suite to have buffers placed right at the end of a page with DEATH waiting across the boundary I noticed that it failed. So I had to revert to the more complicated handling and make it even more complicated as simply checking for null bytes isnt enough I also need to insert a fake null byte wherever the limit is. I solved the first part like this and the latter part is a problem for tomorrow morning. :-)

@@ -103,24 +103,56 @@ ENTRY(strncmp)
    .p2align 4
 .Llt16:
-   tbz w3, #PAGE_SHIFT, 0f
+   /*
+    * Check if either string is located at end of page to avoid crossing
+    * into unmapped page. If so, we load 16 bytes from the nearest
+    * alignment boundary and shift based on the offset.
+    */
+   tbz w3, #PAGE_SHIFT, 2f

    ldr q0, [x8]            // load aligned head
    ldr q1, [x10]

+   lsl x14, x9, #2
+   lsl x15, x11, #2
+   lsl x3, x13, x14            // string head
+   lsl x4, x13, x15
+
+   cmeq    v5.16b, v0.16b, #0
+   cmeq    v6.16b, v1.16b, #0
+
+   shrn    v5.8b, v5.8h, #4
+   shrn    v6.8b, v6.8h, #4
+   fmov    x5, d5
+   fmov    x6, d6
+
    adrp    x14, shift_data
    add x14, x14, :lo12:shift_data

    /* heads may cross page boundary, avoid unmapped loads */
+   tst x5, x3
+   b.eq    0f
+
    ldr q4, [x14, x9]           // load permutation table
    tbl v0.16b, {v0.16b}, v4.16b
+
+   b   1f
+   .p2align 4
+0:
+   ldr q0, [x0]            // load true head
+1:
+   tst x6, x4
+   b.eq    0f
+
    ldr q4, [x14, x11]
    tbl v4.16b, {v1.16b}, v4.16b
+
    b 1f

    .p2align 4
-0:
+2:
    ldr q0, [x0]            // load true heads
+0:
    ldr q4, [x1]
 1:

Next function, strcspn

It’s also time to port a new function, I was choosing between memccpy and strcspn. I chose to start exploring strcspn and it has some really interesting tricks required to get it fast. On x86 when SSE4.2 is available we can use the amazing pcmpistri instruction to compare a vector register with a set without overreading past a null byte. [2]

On Aarch64 there is no equivalent, although SVE2 has the MATCH instruction, but thats to no avail for us. The Graviton 3 CPU that I’ve been benchmarking on has SVE support but not SVE2. And FreeBSD is just about to get SVE support, it’s about to land in -CURRENT any second now IIRC.

strcspn tomfoolery

So checking with SIMD whether a byte is present in a set is not a completely new problem. Although I haven’t seen any such algorithm implemented for Aarch64 then there is one for x86 developed by Wojciech Muła and Geoff Langdale. [3]

I won’t go into detail just yet, it’s quite a lot to wrap my head around but the article linked is a great resource. A version of this algorithm is used by the Intel Hyperscan project. [4][5]

An interesting suggestion fuz had was to duplicate a byte in the set and iterative over the string during the setup phase. In the case of a match of a early set member then this approach would be greatly beneficial.

This as the LUT for the algorithm can be done using scalar registers and the check by ld1r, cmeq could be done in vector registers. Fully utilizing the processors different pipelines.

Another useful article is [6] by Harold Aptroot but the lack of the extremely versatile but tricky gf2p8affineqb Galois Field Affine Transformation Instruction on Aarch64 will need to be figured out.

A list of unexcepted uses for the Galois Field Affine Transformation Instruction is available here [7]

Performance analysis

I was also introduced by my mentor to some really cool projects which simulate different computer architectures and predicts the throughput of basic blocks. The first one is uiCA (uops.info Code Analyzer) [8] developed by researchers at Saarland University. The second one is OSACA (Open Source Architecture Code Analyzer) [9] which supports both x86 and Aarch64. It’s developed by RRZE-HPC at the Erlangen National High Performance Computing Center. OSACA also has integration with Compiler Explorer.

These tools are a clear step up from SimpleScalar that I used in university for our computer architecture course. [10] Simplescalar is legacy research software. I suspect it might soon be swapped out in our curriculum, that course has been getting serious upgrades the last few years.

For my strncmp implementation OSACA produces the following:

OSACA results

Open Source Architecture Code Analyzer (OSACA) - 0.5.2
Analyzed file:      /app/example.asm
Architecture:       A64FX
Timestamp:          2024-07-14 17:00:03

-------------------------- WARNING: No micro-architecture was specified -------------------------
         A default uarch for this particular ISA was used. Specify the uarch with --arch.
         See --help for more information.
-------------------------------------------------------------------------------------------------
----------------- WARNING: You are analyzing a large amount of instruction forms ----------------
         Analysis across loops/block boundaries often do not make much sense.
         Specify the kernel length with --length. See --help for more information.
         If this is intentional, you can safely ignore this message.
-------------------------------------------------------------------------------------------------

 P - Throughput of LOAD operation can be hidden behind a past or future STORE instruction
 * - Instruction micro-ops not bound to a port
 X - No throughput/latency information for this instruction in data file

Combined Analysis Report
------------------------
                                      Port pressure in cycles                                       
     |  0   - 0DV  |  1   |  2   |  3   |   4   |  5   -  5D  |  6   -  6D  |  7   ||  CP  | LCD  |
---------------------------------------------------------------------------------------------------
   1 |             |      |      |      |       |             |             |      ||      |      |   strncmp:
   2 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X bic x8, x0, #0xf   // x8 is x0 but aligned to the boundary
   3 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   and x9, x0, #0xf   // x9 is the offset
   4 |             |      |      |      |       |             |             |      ||      |      | X bic x10, x1, #0xf   // x10 is x1 but aligned to the boundary
   5 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   and x11, x1, #0xf   // x11 is the offset
   7 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   subs x2, x2, #1
   8 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.mi .Lempty
  10 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   mov x13, #-1    // save constant for later
  12 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x3, x0, #16   // end of head
  13 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x4, x1, #16
  14 |             |      |      |      |       |             |             |      ||      |      | X eor x3, x3, x0
  15 |             |      |      |      |       |             |             |      ||      |      | X eor x4, x4, x1   // bits that changed
  16 |             |      |      | 0.50 | 0.500 | 0.00        | 0.00        |      ||      |      |   orr x3, x3, x4   // in either str1 or str2
  17 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2,#16
  18 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.lt .Llt16
  19 |             |      |      |      |       |             |             |      ||      |      | X tbz w3, #4096, .Lbegin
  21 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x8]   // load aligned head
  22 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x10]
  24 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x14, x9, #2
  25 |             |      |      |      |       |             |             |      ||      |      | X lsl x3, x13, x14   // string head
  26 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x15, x11, #2
  27 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x15
  29 |             |      |      |      |       |             |             |      ||      |      | X cmeq v5.16b, v0.16b, #0
  30 |             |      |      |      |       |             |             |      ||      |      | X cmeq v6.16b, v1.16b, #0
  32 |             |      |      |      |       |             |             |      ||      |      | X shrn v5.8b, v5.8h, #4
  33 |             |      |      |      |       |             |             |      ||      |      | X shrn v6.8b, v6.8h, #4
  34 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d5
  35 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d6
  37 |             |      |      |      |       |             |             |      ||      |      | X adrp x14, shift_data
  38 |             |      |      |      |       |             |             |      ||      |      | X add x14, x14, :lo12:shift_data
  40 |             |      |      |      |       |             |             |      ||      |      | X tst x5, x3
  41 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.eq zero
  43 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q4, [x14, x9]   // load permutation table
  44 |             |      |      |      |       |             |             |      ||      |      | X tbl v0.16b, {v0.16b}, v4.16b
  46 |             |      |      |      |       |             |             | 1.00 ||      |      |   b one
  47 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
  48 |             |      |      |      |       |             |             |      ||      |      |   zero:
  49 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x0]   // load true head
  50 |             |      |      |      |       |             |             |      ||      |      |   one:
  51 |             |      |      |      |       |             |             |      ||      |      | X tst x6, x4
  52 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.eq zeroo
  54 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q4, [x14, x11]
  55 |             |      |      |      |       |             |             |      ||      |      | X tbl v4.16b, {v1.16b}, v4.16b
  57 |             |      |      |      |       |             |             | 1.00 ||      |      |   b onee
  59 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
  60 |             |      |      |      |       |             |             |      ||      |      |   .Lbegin:
  61 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x0]   // load true heads
  62 |             |      |      |      |       |             |             |      ||      |      |   zeroo:
  63 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q4, [x1]
  64 |             |      |      |      |       |             |             |      ||      |      |   onee:
  65 |             |      |      |      |       |             |             |      ||      |      | X cmeq v2.16b, v0.16b, #0  // NUL byte present?
  66 |             |      |      |      |       |             |             |      ||      |      | X cmeq v4.16b, v0.16b, v4.16b  // which bytes match?
  68 |             |      |      |      |       |             |             |      ||      |      | X orn v2.16b, v2.16b, v4.16b  // mismatch or NUL byte?
  70 |             |      |      |      |       |             |             |      ||      |      | X shrn v2.8b, v2.8h, #4
  71 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d2
  73 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lhead_mismatch
  74 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q2, [x8, #16]   // load second chunk
  75 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q3, [x10, #16]
  77 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x2, x2, x11
  78 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x2, x2, #16   // account for length of RSI chunk?
  80 |             |      |      |      |       |             |             |      ||      |      | X subs x9, x9, x11
  81 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.lt .Lswapped   // if not swap operands
  82 |             |      |      |      |       |             |             | 1.00 ||      |      |   b .Lnormal
  84 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
  85 |             |      |      |      |       |             |             |      ||      |      |   .Llt16:
  87 |             |      |      |      |       |             |             |      ||      |      | X tbz w3, #4096, two
  89 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x8]   // load aligned head
  90 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x10]
  92 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x14, x9, #2
  93 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x15, x11, #2
  94 |             |      |      |      |       |             |             |      ||      |      | X lsl x3, x13, x14   // string head
  95 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x15
  97 |             |      |      |      |       |             |             |      ||      |      | X cmeq v5.16b, v0.16b, #0
  98 |             |      |      |      |       |             |             |      ||      |      | X cmeq v6.16b, v1.16b, #0
 100 |             |      |      |      |       |             |             |      ||      |      | X shrn v5.8b, v5.8h, #4
 101 |             |      |      |      |       |             |             |      ||      |      | X shrn v6.8b, v6.8h, #4
 102 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d5
 103 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d6
 105 |             |      |      |      |       |             |             |      ||      |      | X adrp x14, shift_data
 106 |             |      |      |      |       |             |             |      ||      |      | X add x14, x14, :lo12:shift_data
 108 |             |      |      |      |       |             |             |      ||      |      | X tst x5, x3
 109 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.eq zerooo
 111 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q4, [x14, x9]   // load permutation table
 112 |             |      |      |      |       |             |             |      ||      |      | X tbl v0.16b, {v0.16b}, v4.16b
 114 |             |      |      |      |       |             |             | 1.00 ||      |      |   b oneee
 115 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 116 |             |      |      |      |       |             |             |      ||      |      |   zerooo:
 117 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x0]   // load true head
 118 |             |      |      |      |       |             |             |      ||      |      |   oneee:
 119 |             |      |      |      |       |             |             |      ||      |      | X tst x6, x4
 120 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.eq noll
 122 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q4, [x14, x11]
 123 |             |      |      |      |       |             |             |      ||      |      | X tbl v4.16b, {v1.16b}, v4.16b
 125 |             |      |      |      |       |             |             | 1.00 ||      |      |   b ett
 127 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 128 |             |      |      |      |       |             |             |      ||      |      |   two:
 129 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x0]   // load true heads
 130 |             |      |      |      |       |             |             |      ||      |      |   noll:
 131 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q4, [x1]
 132 |             |      |      |      |       |             |             |      ||      |      |   ett:
 134 |             |      |      |      |       |             |             |      ||      |      | X cmeq v2.16b, v0.16b, #0  // NUL byte present?
 135 |             |      |      |      |       |             |             |      ||      |      | X cmeq v4.16b, v0.16b, v4.16b  // which bytes match?
 137 |             |      |      |      |       |             |             |      ||      |      | X bic v2.16b, v4.16b, v2.16b  // match and not NUL byte
 139 |             |      |      |      |       |             |             |      ||      |      | X shrn v2.8b, v2.8h, #4
 140 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d2
 141 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x4, x2, #2
 142 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x4
 143 |             |      |      |      |       |             |             |      ||      |      | X orn x5, x4, x5   // mismatch or NUL byte?
 145 |             |      |      |      |       |             |             |      ||      |      |   .Lhead_mismatch:
 146 |             |      |      |      |       |             |             |      ||      |      | X rbit x3, x5
 147 |             |      |      |      |       |             |             |      ||      |      | X clz x3, x3    // index of mismatch
 148 |             |      |      |      |       |             |             |      ||      |      | X lsr x3, x3, #2
 149 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldrb w4, [x0, x3]
 150 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldrb w5, [x1, x3]
 151 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub w0, w4, w5
 152 |             |      |      |      |       | 0.50        | 0.50        |      ||      |      |   ret
 154 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 155 |             |      |      |      |       |             |             |      ||      |      |   .Lnormal:
 156 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x12, x10, x9
 157 |             |      |      | 0.75 | 0.750 | 0.25   0.50 | 0.25   0.50 |      ||      |      |   ldr q0, [x12, #16]!
 158 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x10, x10, x8
 159 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x11, x10, x9
 161 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v3.16b, #0  // NUL present?
 162 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b  // Mismatch between chunks?
 163 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 164 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 165 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 166 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 168 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x8, x8, #32   // advance to next iteration
 170 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x4, x2, #2
 171 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x4
 172 |             |      |      | 0.50 | 0.500 | 0.00        | 0.00        |      ||      |      |   orr x3, x6, x4   // introduce a null byte match
 173 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2, #16    // does the buffer end within x2
 174 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   csel x6, x3, x6, lt
 175 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfound2   // NUL or end of buffer found?
 176 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5
 177 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatch2
 178 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x2, x2, #16
 179 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2, #32    // end of buffer within first main loop iteration?
 180 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.lt .Ltail
 182 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 183 |             |      |      |      |       |             |             |      ||      |      |   nada:
 184 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x8, x11]
 185 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x8, x10]
 186 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q2, [x8]
 188 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v1.16b, #0  // end of string?
 189 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b  // do the chunks match?
 191 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 192 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 193 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 194 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 195 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfound
 196 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5    // any mismatches?
 197 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatch
 199 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x8, x8, #16
 202 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x8, x11]
 203 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x8, x10]
 204 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q2, [x8]
 206 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x8, x8, #16
 207 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v1.16b, #0
 208 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b
 210 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 211 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 212 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 213 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 214 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfound2
 215 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5
 216 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatch2
 217 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x2, x2, #32
 218 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2, #32    // end of buffer within next iteration
 219 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.ge nada    // if yes, process tail
 222 |             |      |      |      |       |             |             |      ||      |      |   .Ltail:
 223 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x8, x11]
 224 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x8, x10]
 225 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q2, [x8]
 227 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v1.16b, #0  // end of string?
 228 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b  // do the chunks match?
 230 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 231 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 232 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 233 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 237 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x4, x2, #2
 238 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x4
 239 |             |      |      | 0.50 | 0.500 | 0.00        | 0.00        |      ||      |      |   orr x3, x6, x4   // introduce a null byte match
 240 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2, #16    // does the buffer end within x2
 241 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   csel x6, x3, x6, lt
 243 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfound   // NUL or end of string found
 244 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5
 245 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatch
 247 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x8, x8, #16
 249 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x8, x11]
 250 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x8, x10]
 251 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q2, [x8]
 253 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x8, x8, #16
 254 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v1.16b, #0
 255 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b
 257 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 258 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 259 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 260 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 262 |             |      |      |      |       |             |             |      ||      |      | X ubfiz x4, x2, #2, #4 // (x2 - 16) << 2
 263 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x4   // take first half into account
 264 |             |      |      | 0.50 | 0.500 | 0.00        | 0.00        |      ||      |      |   orr x6, x6, x4   // introduce a null byte match
 266 |             |      |      |      |       |             |             |      ||      |      |   .Lnulfound2:
 267 |             |      |      | 0.50 | 0.500 |             |             |      ||  1.0 |  1.0 |   sub x8, x8, #16
 269 |             |      |      |      |       |             |             |      ||      |      |   .Lnulfound:
 270 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   mov x4, x6
 272 |             |      |      |      |       |             |             |      ||      |      | X ubfiz x7, x9, #2, #4
 273 |             |      |      |      |       |             |             |      ||      |      | X lsl x6, x6, x7   // adjust NUL mask to indices
 275 |             |      |      |      |       |             |             |      ||      |      | X orn x5, x6, x5
 276 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatch
 278 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x8, x9]
 279 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x8, x10]
 281 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v0.16b, v1.16b
 282 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 283 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d1
 285 |             |      |      |      |       |             |             |      ||      |      | X orn x5, x4, x5
 287 |             |      |      |      |       |             |             |      ||      |      | X rbit x3, x5
 288 |             |      |      |      |       |             |             |      ||      |      | X clz x3, x3
 289 |             |      |      |      |       |             |             |      ||      |      | X lsr x5, x3, #2
 291 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x10, x10, x8   // restore x10 pointer
 292 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x8, x8, x9   // point to corresponding chunk in x0
 294 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldrb w4, [x8, x5]
 295 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldrb w5, [x10, x5]
 296 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub w0, w4, w5
 297 |             |      |      |      |       | 0.50        | 0.50        |      ||      |      |   ret
 299 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 300 |             |      |      |      |       |             |             |      ||      |      |   .Lmismatch2:
 301 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x8, x8, #16   // roll back second increment
 302 |             |      |      |      |       |             |             |      ||      |      |   .Lmismatch:
 303 |             |      |      |      |       |             |             |      ||      |      | X rbit x3, x5
 304 |             |      |      |      |       |             |             |      ||      |      | X clz x3, x3    // index of mismatch
 305 |             |      |      |      |       |             |             |      ||      |      | X lsr x3, x3, #2
 306 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x11, x8, x11
 308 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldrb w4, [x8, x3]
 309 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldrb w5, [x11, x3]
 310 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub w0, w4, w5   // difference of the mismatching chars
 311 |             |      |      |      |       | 0.50        | 0.50        |      ||      |      |   ret
 313 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 314 |             |      |      |      |       |             |             |      ||      |      |   .Lswapped:
 315 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x12, x8, x9
 316 |             |      |      | 0.75 | 0.750 | 0.25   0.50 | 0.25   0.50 |      ||      |      |   ldr q0, [x12, #16]!
 317 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x8, x8, x10
 318 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x11, x8, x9
 319 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x2,x2,x9
 320 |             |      |      |      |       |             |             |      ||      |      | X neg x9, x9
 322 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v2.16b, #0
 323 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v3.16b
 324 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 325 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 326 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 327 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 329 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x10, x10, #32
 331 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x4, x2, #2
 332 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x4
 333 |             |      |      | 0.38 | 0.370 | 0.13        | 0.12        |      ||      |      |   orr x3,x6,x4   // introduce a null byte match
 334 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2,#16
 335 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   csel x6, x3, x6, lt
 336 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfound2s
 337 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5
 338 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatch2s
 340 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x2, x2, #16
 341 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2, #32
 342 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.lt .Ltails
 344 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 345 |             |      |      |      |       |             |             |      ||      |      |   nein:
 346 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x10, x11]
 347 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x10, x8]
 348 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q2, [x10]
 350 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v1.16b, #0
 351 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b
 353 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 354 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 355 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 356 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 357 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfounds
 358 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5
 359 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatchs
 361 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x10, x10, #16
 364 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x10, x11]
 365 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x10, x8]
 366 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q2, [x10]
 368 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |  1.0 |   add x10, x10, #16
 369 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v1.16b, #0
 370 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b
 372 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 373 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 374 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 375 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 376 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfound2s
 377 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5
 378 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatch2s
 379 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x2, x2, #32
 380 |             |      |      | 0.25 | 0.750 |             |             |      ||      |      |   cmp x2, #32
 381 |             |      |      |      |       |             |             | 1.00 ||      |      |   b.ge nein
 383 |             |      |      |      |       |             |             |      ||      |      |   .Ltails:
 384 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x10, x11]
 385 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q1, [x10, x8]
 386 |             |      |      |      |       | 0.51   0.50 | 0.49   0.50 |      ||      |      |   ldr q2, [x10]
 388 |             |      |      |      |       |             |             |      ||      |      | X cmeq v1.16b, v1.16b, #0
 389 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b
 391 |             |      |      |      |       |             |             |      ||      |      | X shrn v1.8b, v1.8h, #4
 392 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 393 |             |      |      |      |       |             |             |      ||      |      | X fmov x6, d1
 394 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 397 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   lsl x4, x2, #2
 398 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x4
 399 |             |      |      | 0.50 | 0.000 | 0.24        | 0.26        |      ||      |      |   orr x3, x6, x4    // introduce a null byte match
 400 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cmp x2, #16
 401 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   csel x6, x3, x6, lt
 403 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x6, .Lnulfounds
 404 |             |      |      |      |       |             |             |      ||      |      | X mvn x5, x5
 405 |             |      |      | 0.74 | 0.260 |             |             |      ||      |      |   cbnz x5, .Lmismatchs
 407 | 0.50        |      | 0.50 | 0.00 | -0.01 |             |             |      ||  1.0 |  1.0 |   add x10, x10, #16
 409 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x10, x11]
 410 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||  5.0 |      |   ldr q1, [x10, x8]
 411 |             |      |      |      |       | 0.51   0.50 | 0.49   0.50 |      ||      |      |   ldr q2, [x10]
 413 | 0.50        |      | 0.50 | 0.00 | -0.01 |             |             |      ||      |  1.0 |   add x10, x10, #16
 414 |             |      |      |      |       |             |             |      ||  0.0 |      | X cmeq v1.16b, v1.16b, #0
 415 |             |      |      |      |       |             |             |      ||      |      | X cmeq v0.16b, v0.16b, v2.16b
 417 |             |      |      |      |       |             |             |      ||  0.0 |      | X shrn v1.8b, v1.8h, #4
 418 |             |      |      |      |       |             |             |      ||      |      | X shrn v0.8b, v0.8h, #4
 419 |             |      |      |      |       |             |             |      ||  0.0 |      | X fmov x6, d1
 420 |             |      |      |      |       |             |             |      ||      |      | X fmov x5, d0
 422 |             |      |      |      |       |             |             |      ||      |      | X ubfiz x4, x2, #2, #4
 423 |             |      |      |      |       |             |             |      ||      |      | X lsl x4, x13, x4
 424 |             |      |      | 0.00 | 0.510 | 0.23        | 0.26        |      ||  1.0 |      |   orr x6, x6, x4    // introduce a null byte match
 426 |             |      |      |      |       |             |             |      ||      |      |   .Lnulfound2s:
 427 |             |      |      | 0.50 | 0.500 |             |             |      ||      |  1.0 |   sub x10, x10, #16
 428 |             |      |      |      |       |             |             |      ||      |      |   .Lnulfounds:
 429 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||  1.0 |      |   mov x4, x6
 431 |             |      |      |      |       |             |             |      ||      |      | X ubfiz x7, x9, #2, #4
 432 |             |      |      |      |       |             |             |      ||      |      | X lsl x6, x6, x7
 434 |             |      |      |      |       |             |             |      ||      |      | X orn x5, x6, x5
 436 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   cbnz x5, .Lmismatchs
 438 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldr q0, [x10, x9]
 439 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |  5.0 |   ldr q1, [x10, x8]
 441 |             |      |      |      |       |             |             |      ||      |  0.0 | X cmeq v1.16b, v0.16b, v1.16b
 442 |             |      |      |      |       |             |             |      ||      |  0.0 | X shrn v1.8b, v1.8h, #4
 443 |             |      |      |      |       |             |             |      ||      |  0.0 | X fmov x5, d1
 445 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X orn x5, x4, x5
 447 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X rbit x3, x5
 448 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X clz x3, x3
 449 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X lsr x5, x3, #2
 451 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x11, x10, x8
 452 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x10, x10, x9
 454 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |      |   ldrb w4, [x10, x5]
 455 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||  5.0 |  5.0 |   ldrb w5, [x11, x5]
 456 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub w0, w5, w4
 457 |             |      |      |      |       | 0.50        | 0.50        |      ||      |      |   ret
 459 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 460 |             |      |      |      |       |             |             |      ||      |      |   .Lmismatch2s:
 461 |             |      |      | 0.50 | 0.500 |             |             |      ||      |      |   sub x10, x10, #16
 462 |             |      |      |      |       |             |             |      ||      |      |   .Lmismatchs:
 463 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X rbit x3, x5
 464 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X clz x3, x3
 465 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X lsr x3, x3, #2
 466 | 0.50        |      | 0.50 | 0.00 | 0.000 |             |             |      ||      |      |   add x11, x10, x11
 468 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||  5.0 |      |   ldrb w4, [x10, x3]
 469 |             |      |      |      |       | 0.50   0.50 | 0.50   0.50 |      ||      |  5.0 |   ldrb w5, [x11, x3]
 470 |             |      |      | 0.50 | 0.500 |             |             |      ||  1.0 |  1.0 |   sub w0, w5, w4
 471 |             |      |      |      |       | 0.50        | 0.50        |      ||      |      |   ret
 473 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 474 |             |      |      |      |       |             |             |      ||      |      |   .Lempty:
 475 |             |      |      |      |       |             |             |      ||  0.0 |  0.0 | X eor x0, x0, x0
 476 |             |      |      |      |       | 0.50        | 0.50        |      ||      |      |   ret
 478 |             |      |      |      |       |             |             |      ||      |      |   .section .rodata
 479 |             |      |      |      |       |             |             |      ||      |      |   .p2align 4
 480 |             |      |      |      |       |             |             |      ||      |      |   shift_data:
 481 |             |      |      |      |       |             |             |      ||      |      |   .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
 482 |             |      |      |      |       |             |             |      ||      |      |   .fill 16, 1, -1
 483 |             |      |      |      |       |             |             |      ||      |      |   .size shift_data, .-shift_data

       18.0                 18.0   29.9   29.87   31.1   28.0   31.1   28.0   16.0    29.0   29.0  

Loop-Carried Dependencies Analysis Report
-----------------------------------------
   2 | 21.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 21.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 19.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 19.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 15.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 455, 463, 464, 465, 468, 470, 475]
   2 | 15.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 455, 463, 464, 465, 469, 470, 475]
   2 | 11.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 466, 469, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 20.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 468, 470, 475]
   2 | 20.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 469, 470, 475]
   2 | 16.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 466, 469, 470, 475]
   2 | 17.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 452, 461, 466, 469, 470, 475]
   2 | 16.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 452, 461, 468, 470, 475]
   2 | 26.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 26.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 24.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 20.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 455, 463, 464, 465, 468, 470, 475]
   2 | 20.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 455, 463, 464, 465, 469, 470, 475]
   2 | 16.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 466, 469, 470, 475]
   2 | 29.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 29.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 29.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 29.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 29.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 29.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 25.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 468, 470, 475]
   2 | 25.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 469, 470, 475]
   2 | 21.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 466, 469, 470, 475]
   2 | 22.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 452, 461, 466, 469, 470, 475]
   2 | 21.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 452, 461, 468, 470, 475]
   2 | 27.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 27.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 25.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   2 | 25.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   2 | 21.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 455, 463, 464, 465, 468, 470, 475]
   2 | 21.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 455, 463, 464, 465, 469, 470, 475]
   2 | 17.0 | bic       x8, x0, #0xf              // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 466, 469, 470, 475]
   3 | 22.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   3 | 22.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   3 | 20.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   3 | 20.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   3 | 16.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 292, 301, 317, 451, 455, 463, 464, 465, 468, 470, 475]
   3 | 16.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 292, 301, 317, 451, 455, 463, 464, 465, 469, 470, 475]
   3 | 12.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 292, 301, 317, 451, 466, 469, 470, 475]
   3 | 17.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 319, 340, 379, 422, 423, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   3 | 17.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 319, 340, 379, 422, 423, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   3 | 17.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 320, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475]
   3 | 17.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 320, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475]
   3 | 10.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 320, 452, 461, 466, 469, 470, 475]
   3 |  9.0 | and       x9, x0, #0xf              // x9 is the offset| [3, 80, 320, 452, 461, 468, 470, 475]
   7 |  8.0 | subs      x2, x2, #1                     | [7, 77, 78, 178, 217, 319, 340, 379]

It’s also accessible here if you want to play around with it.

References

[1]: https://reviews.freebsd.org/D45943
[2]: https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/strcspn.S#n228
[3]: http://0x80.pl/articles/simd-byte-lookup.html
[4]: https://twitter.com/geofflangdale/status/1053227022795722752
[5]: https://github.com/intel/hyperscan/blob/master/src/nfa/truffle.c
[6]: https://bitmath.blogspot.com/2023/04/not-transposing-16x16-bitmatrix.html
[7]: https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1b137ca0
[8]: https://uica.uops.info/
[9]: https://github.com/RRZE-HPC/OSACA
[10]: https://pages.cs.wisc.edu/~mscalar/simplescalar.html