Getz Mikalsen Table of Contents _________________ 1. GSoC status update #4 (A bit of everything) 2. Whats been going on this week .. 1. strncmp pains .. 2. Next function, `strcspn' ..... 1. strcspn tomfoolery .. 3. Performance analysis ..... 1. OSACA results 3. References 1 GSoC status update #4 (A bit of everything) ============================================= 14-07-2024 *Table of Contents* - [GSoC status update #4 (A bit of everything)] - [Whats been going on this week] - [strncmp pains] - [Next function, `strcspn'] - [strcspn tomfoolery] - [Performance analysis] - [OSACA results] - [References] [GSoC status update #4 (A bit of everything)] See section 1 [Whats been going on this week] See section 2 [strncmp pains] See section 2.1 [Next function, `strcspn'] See section 2.2 [strcspn tomfoolery] See section 2.2.1 [Performance analysis] See section 2.3 [OSACA results] See section 2.3.1 [References] See section 3 2 Whats been going on this week =============================== This week I did some finishing touches on strncmp, it's essentially the same as `strcmp' except with special handling for the limit and breaks the main loop when theres less than 32 bytes left of the limit. I also got some good code review on `str(n)cmp' and applied it which resulted in a few % performance improvement. [[1]] [[1] 2.1 strncmp pains ~~~~~~~~~~~~~~~~~ I thought I could do a clever solution for handling strings near a page boundary for short strings (less than 16 bytes) for strncmp but I was mistaken. After updating the test suite to have buffers placed right at the end of a page with `DEATH' waiting across the boundary I noticed that it failed. So I had to revert to the more complicated handling and make it even more complicated as simply checking for null bytes isnt enough I also need to insert a fake null byte wherever the limit is. I solved the first part like this and the latter part is a problem for tomorrow morning. :-) ,---- | @@ -103,24 +103,56 @@ ENTRY(strncmp) | .p2align 4 | .Llt16: | - tbz w3, #PAGE_SHIFT, 0f | + /* | + * Check if either string is located at end of page to avoid crossing | + * into unmapped page. If so, we load 16 bytes from the nearest | + * alignment boundary and shift based on the offset. | + */ | + tbz w3, #PAGE_SHIFT, 2f | | ldr q0, [x8] // load aligned head | ldr q1, [x10] | | + lsl x14, x9, #2 | + lsl x15, x11, #2 | + lsl x3, x13, x14 // string head | + lsl x4, x13, x15 | + | + cmeq v5.16b, v0.16b, #0 | + cmeq v6.16b, v1.16b, #0 | + | + shrn v5.8b, v5.8h, #4 | + shrn v6.8b, v6.8h, #4 | + fmov x5, d5 | + fmov x6, d6 | + | adrp x14, shift_data | add x14, x14, :lo12:shift_data | | /* heads may cross page boundary, avoid unmapped loads */ | + tst x5, x3 | + b.eq 0f | + | ldr q4, [x14, x9] // load permutation table | tbl v0.16b, {v0.16b}, v4.16b | + | + b 1f | + .p2align 4 | +0: | + ldr q0, [x0] // load true head | +1: | + tst x6, x4 | + b.eq 0f | + | ldr q4, [x14, x11] | tbl v4.16b, {v1.16b}, v4.16b | + | b 1f | | .p2align 4 | -0: | +2: | ldr q0, [x0] // load true heads | +0: | ldr q4, [x1] | 1: `---- 2.2 Next function, `strcspn' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It's also time to port a new function, I was choosing between `memccpy' and `strcspn'. I chose to start exploring `strcspn' and it has some really interesting tricks required to get it fast. On x86 when `SSE4.2' is available we can use the amazing `pcmpistri' instruction to compare a vector register with a set without overreading past a null byte. [[2]] On Aarch64 there is no equivalent, although SVE2 has the `MATCH' instruction, but thats to no avail for us. The Graviton 3 CPU that I've been benchmarking on has `SVE' support but not `SVE2'. And FreeBSD is just about to get SVE support, it's about to land in -CURRENT any second now IIRC. [[2] 2.2.1 strcspn tomfoolery ------------------------ So checking with SIMD whether a byte is present in a set is not a completely new problem. Although I haven't seen any such algorithm implemented for `Aarch64' then there is one for x86 developed by Wojciech Muła and Geoff Langdale. [[3]] I won't go into detail just yet, it's quite a lot to wrap my head around but the article linked is a great resource. A version of this algorithm is used by the Intel Hyperscan project. [[4]] An interesting suggestion fuz had was to duplicate a byte in the set and iterative over the string during the setup phase. In the case of a match of a early set member then this approach would be greatly beneficial. This as the LUT for the algorithm can be done using scalar registers and the check by `ld1r', `cmeq' could be done in vector registers. Fully utilizing the processors different pipelines. Another useful article is [[6]] by Harold Aptroot but the lack of the extremely versatile but tricky `gf2p8affineqb' Galois Field Affine Transformation Instruction on Aarch64 will need to be figured out. A list of unexcepted uses for the Galois Field Affine Transformation Instruction is available here [[7]] [[3] [[4] [[6] [[7] 2.3 Performance analysis ~~~~~~~~~~~~~~~~~~~~~~~~ I was also introduced by my mentor to some really cool projects which simulate different computer architectures and predicts the throughput of basic blocks. The first one is uiCA (uops.info Code Analyzer) [[8]] developed by researchers at Saarland University. The second one is OSACA (Open Source Architecture Code Analyzer) [[9]] which supports both x86 and Aarch64. It's developed by RRZE-HPC at the Erlangen National High Performance Computing Center. OSACA also has integration with Compiler Explorer. These tools are a clear step up from SimpleScalar that I used in university for our computer architecture course. [[10]] Simplescalar is legacy research software. I suspect it might soon be swapped out in our curriculum, that course has been getting serious upgrades the last few years. For my strncmp implementation OSACA produces the following: [[8] [[9] [[10] 2.3.1 OSACA results ------------------- ,---- | Open Source Architecture Code Analyzer (OSACA) - 0.5.2 | Analyzed file: /app/example.asm | Architecture: A64FX | Timestamp: 2024-07-14 17:00:03 | | -------------------------- WARNING: No micro-architecture was specified ------------------------- | A default uarch for this particular ISA was used. Specify the uarch with --arch. | See --help for more information. | ------------------------------------------------------------------------------------------------- | ----------------- WARNING: You are analyzing a large amount of instruction forms ---------------- | Analysis across loops/block boundaries often do not make much sense. | Specify the kernel length with --length. See --help for more information. | If this is intentional, you can safely ignore this message. | ------------------------------------------------------------------------------------------------- | | P - Throughput of LOAD operation can be hidden behind a past or future STORE instruction | * - Instruction micro-ops not bound to a port | X - No throughput/latency information for this instruction in data file | | Combined Analysis Report | ------------------------ | Port pressure in cycles | | 0 - 0DV | 1 | 2 | 3 | 4 | 5 - 5D | 6 - 6D | 7 || CP | LCD | | --------------------------------------------------------------------------------------------------- | 1 | | | | | | | | || | | strncmp: | 2 | | | | | | | | || 0.0 | 0.0 | X bic x8, x0, #0xf // x8 is x0 but aligned to the boundary | 3 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | and x9, x0, #0xf // x9 is the offset | 4 | | | | | | | | || | | X bic x10, x1, #0xf // x10 is x1 but aligned to the boundary | 5 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | and x11, x1, #0xf // x11 is the offset | 7 | | | | 0.50 | 0.500 | | | || | | subs x2, x2, #1 | 8 | | | | | | | | 1.00 || | | b.mi .Lempty | 10 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | mov x13, #-1 // save constant for later | 12 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x3, x0, #16 // end of head | 13 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x4, x1, #16 | 14 | | | | | | | | || | | X eor x3, x3, x0 | 15 | | | | | | | | || | | X eor x4, x4, x1 // bits that changed | 16 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x3, x3, x4 // in either str1 or str2 | 17 | | | | 0.50 | 0.500 | | | || | | cmp x2,#16 | 18 | | | | | | | | 1.00 || | | b.lt .Llt16 | 19 | | | | | | | | || | | X tbz w3, #4096, .Lbegin | 21 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8] // load aligned head | 22 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10] | 24 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x14, x9, #2 | 25 | | | | | | | | || | | X lsl x3, x13, x14 // string head | 26 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x15, x11, #2 | 27 | | | | | | | | || | | X lsl x4, x13, x15 | 29 | | | | | | | | || | | X cmeq v5.16b, v0.16b, #0 | 30 | | | | | | | | || | | X cmeq v6.16b, v1.16b, #0 | 32 | | | | | | | | || | | X shrn v5.8b, v5.8h, #4 | 33 | | | | | | | | || | | X shrn v6.8b, v6.8h, #4 | 34 | | | | | | | | || | | X fmov x5, d5 | 35 | | | | | | | | || | | X fmov x6, d6 | 37 | | | | | | | | || | | X adrp x14, shift_data | 38 | | | | | | | | || | | X add x14, x14, :lo12:shift_data | 40 | | | | | | | | || | | X tst x5, x3 | 41 | | | | | | | | 1.00 || | | b.eq zero | 43 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x9] // load permutation table | 44 | | | | | | | | || | | X tbl v0.16b, {v0.16b}, v4.16b | 46 | | | | | | | | 1.00 || | | b one | 47 | | | | | | | | || | | .p2align 4 | 48 | | | | | | | | || | | zero: | 49 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true head | 50 | | | | | | | | || | | one: | 51 | | | | | | | | || | | X tst x6, x4 | 52 | | | | | | | | 1.00 || | | b.eq zeroo | 54 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x11] | 55 | | | | | | | | || | | X tbl v4.16b, {v1.16b}, v4.16b | 57 | | | | | | | | 1.00 || | | b onee | 59 | | | | | | | | || | | .p2align 4 | 60 | | | | | | | | || | | .Lbegin: | 61 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true heads | 62 | | | | | | | | || | | zeroo: | 63 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x1] | 64 | | | | | | | | || | | onee: | 65 | | | | | | | | || | | X cmeq v2.16b, v0.16b, #0 // NUL byte present? | 66 | | | | | | | | || | | X cmeq v4.16b, v0.16b, v4.16b // which bytes match? | 68 | | | | | | | | || | | X orn v2.16b, v2.16b, v4.16b // mismatch or NUL byte? | 70 | | | | | | | | || | | X shrn v2.8b, v2.8h, #4 | 71 | | | | | | | | || | | X fmov x5, d2 | 73 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lhead_mismatch | 74 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8, #16] // load second chunk | 75 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q3, [x10, #16] | 77 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x2, x2, x11 | 78 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #16 // account for length of RSI chunk? | 80 | | | | | | | | || | | X subs x9, x9, x11 | 81 | | | | | | | | 1.00 || | | b.lt .Lswapped // if not swap operands | 82 | | | | | | | | 1.00 || | | b .Lnormal | 84 | | | | | | | | || | | .p2align 4 | 85 | | | | | | | | || | | .Llt16: | 87 | | | | | | | | || | | X tbz w3, #4096, two | 89 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8] // load aligned head | 90 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10] | 92 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x14, x9, #2 | 93 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x15, x11, #2 | 94 | | | | | | | | || | | X lsl x3, x13, x14 // string head | 95 | | | | | | | | || | | X lsl x4, x13, x15 | 97 | | | | | | | | || | | X cmeq v5.16b, v0.16b, #0 | 98 | | | | | | | | || | | X cmeq v6.16b, v1.16b, #0 | 100 | | | | | | | | || | | X shrn v5.8b, v5.8h, #4 | 101 | | | | | | | | || | | X shrn v6.8b, v6.8h, #4 | 102 | | | | | | | | || | | X fmov x5, d5 | 103 | | | | | | | | || | | X fmov x6, d6 | 105 | | | | | | | | || | | X adrp x14, shift_data | 106 | | | | | | | | || | | X add x14, x14, :lo12:shift_data | 108 | | | | | | | | || | | X tst x5, x3 | 109 | | | | | | | | 1.00 || | | b.eq zerooo | 111 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x9] // load permutation table | 112 | | | | | | | | || | | X tbl v0.16b, {v0.16b}, v4.16b | 114 | | | | | | | | 1.00 || | | b oneee | 115 | | | | | | | | || | | .p2align 4 | 116 | | | | | | | | || | | zerooo: | 117 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true head | 118 | | | | | | | | || | | oneee: | 119 | | | | | | | | || | | X tst x6, x4 | 120 | | | | | | | | 1.00 || | | b.eq noll | 122 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x14, x11] | 123 | | | | | | | | || | | X tbl v4.16b, {v1.16b}, v4.16b | 125 | | | | | | | | 1.00 || | | b ett | 127 | | | | | | | | || | | .p2align 4 | 128 | | | | | | | | || | | two: | 129 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x0] // load true heads | 130 | | | | | | | | || | | noll: | 131 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q4, [x1] | 132 | | | | | | | | || | | ett: | 134 | | | | | | | | || | | X cmeq v2.16b, v0.16b, #0 // NUL byte present? | 135 | | | | | | | | || | | X cmeq v4.16b, v0.16b, v4.16b // which bytes match? | 137 | | | | | | | | || | | X bic v2.16b, v4.16b, v2.16b // match and not NUL byte | 139 | | | | | | | | || | | X shrn v2.8b, v2.8h, #4 | 140 | | | | | | | | || | | X fmov x5, d2 | 141 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2 | 142 | | | | | | | | || | | X lsl x4, x13, x4 | 143 | | | | | | | | || | | X orn x5, x4, x5 // mismatch or NUL byte? | 145 | | | | | | | | || | | .Lhead_mismatch: | 146 | | | | | | | | || | | X rbit x3, x5 | 147 | | | | | | | | || | | X clz x3, x3 // index of mismatch | 148 | | | | | | | | || | | X lsr x3, x3, #2 | 149 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x0, x3] | 150 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w5, [x1, x3] | 151 | | | | 0.50 | 0.500 | | | || | | sub w0, w4, w5 | 152 | | | | | | 0.50 | 0.50 | || | | ret | 154 | | | | | | | | || | | .p2align 4 | 155 | | | | | | | | || | | .Lnormal: | 156 | | | | 0.50 | 0.500 | | | || | | sub x12, x10, x9 | 157 | | | | 0.75 | 0.750 | 0.25 0.50 | 0.25 0.50 | || | | ldr q0, [x12, #16]! | 158 | | | | 0.50 | 0.500 | | | || | | sub x10, x10, x8 | 159 | | | | 0.50 | 0.500 | | | || | | sub x11, x10, x9 | 161 | | | | | | | | || | | X cmeq v1.16b, v3.16b, #0 // NUL present? | 162 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b // Mismatch between chunks? | 163 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 164 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 165 | | | | | | | | || | | X fmov x6, d1 | 166 | | | | | | | | || | | X fmov x5, d0 | 168 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #32 // advance to next iteration | 170 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2 | 171 | | | | | | | | || | | X lsl x4, x13, x4 | 172 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x3, x6, x4 // introduce a null byte match | 173 | | | | 0.50 | 0.500 | | | || | | cmp x2, #16 // does the buffer end within x2 | 174 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt | 175 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2 // NUL or end of buffer found? | 176 | | | | | | | | || | | X mvn x5, x5 | 177 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2 | 178 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #16 | 179 | | | | 0.50 | 0.500 | | | || | | cmp x2, #32 // end of buffer within first main loop iteration? | 180 | | | | | | | | 1.00 || | | b.lt .Ltail | 182 | | | | | | | | || | | .p2align 4 | 183 | | | | | | | | || | | nada: | 184 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11] | 185 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10] | 186 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8] | 188 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 // end of string? | 189 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b // do the chunks match? | 191 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 192 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 193 | | | | | | | | || | | X fmov x6, d1 | 194 | | | | | | | | || | | X fmov x5, d0 | 195 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound | 196 | | | | | | | | || | | X mvn x5, x5 // any mismatches? | 197 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch | 199 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16 | 202 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11] | 203 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10] | 204 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8] | 206 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16 | 207 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 | 208 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b | 210 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 211 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 212 | | | | | | | | || | | X fmov x6, d1 | 213 | | | | | | | | || | | X fmov x5, d0 | 214 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2 | 215 | | | | | | | | || | | X mvn x5, x5 | 216 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2 | 217 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #32 | 218 | | | | 0.50 | 0.500 | | | || | | cmp x2, #32 // end of buffer within next iteration | 219 | | | | | | | | 1.00 || | | b.ge nada // if yes, process tail | 222 | | | | | | | | || | | .Ltail: | 223 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11] | 224 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10] | 225 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8] | 227 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 // end of string? | 228 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b // do the chunks match? | 230 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 231 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 232 | | | | | | | | || | | X fmov x6, d1 | 233 | | | | | | | | || | | X fmov x5, d0 | 237 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2 | 238 | | | | | | | | || | | X lsl x4, x13, x4 | 239 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x3, x6, x4 // introduce a null byte match | 240 | | | | 0.50 | 0.500 | | | || | | cmp x2, #16 // does the buffer end within x2 | 241 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt | 243 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound // NUL or end of string found | 244 | | | | | | | | || | | X mvn x5, x5 | 245 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch | 247 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16 | 249 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x11] | 250 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10] | 251 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x8] | 253 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x8, x8, #16 | 254 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 | 255 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b | 257 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 258 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 259 | | | | | | | | || | | X fmov x6, d1 | 260 | | | | | | | | || | | X fmov x5, d0 | 262 | | | | | | | | || | | X ubfiz x4, x2, #2, #4 // (x2 - 16) << 2 | 263 | | | | | | | | || | | X lsl x4, x13, x4 // take first half into account | 264 | | | | 0.50 | 0.500 | 0.00 | 0.00 | || | | orr x6, x6, x4 // introduce a null byte match | 266 | | | | | | | | || | | .Lnulfound2: | 267 | | | | 0.50 | 0.500 | | | || 1.0 | 1.0 | sub x8, x8, #16 | 269 | | | | | | | | || | | .Lnulfound: | 270 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | mov x4, x6 | 272 | | | | | | | | || | | X ubfiz x7, x9, #2, #4 | 273 | | | | | | | | || | | X lsl x6, x6, x7 // adjust NUL mask to indices | 275 | | | | | | | | || | | X orn x5, x6, x5 | 276 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch | 278 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x8, x9] | 279 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x8, x10] | 281 | | | | | | | | || | | X cmeq v1.16b, v0.16b, v1.16b | 282 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 283 | | | | | | | | || | | X fmov x5, d1 | 285 | | | | | | | | || | | X orn x5, x4, x5 | 287 | | | | | | | | || | | X rbit x3, x5 | 288 | | | | | | | | || | | X clz x3, x3 | 289 | | | | | | | | || | | X lsr x5, x3, #2 | 291 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, x8 // restore x10 pointer | 292 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x8, x8, x9 // point to corresponding chunk in x0 | 294 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x8, x5] | 295 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w5, [x10, x5] | 296 | | | | 0.50 | 0.500 | | | || | | sub w0, w4, w5 | 297 | | | | | | 0.50 | 0.50 | || | | ret | 299 | | | | | | | | || | | .p2align 4 | 300 | | | | | | | | || | | .Lmismatch2: | 301 | | | | 0.50 | 0.500 | | | || | | sub x8, x8, #16 // roll back second increment | 302 | | | | | | | | || | | .Lmismatch: | 303 | | | | | | | | || | | X rbit x3, x5 | 304 | | | | | | | | || | | X clz x3, x3 // index of mismatch | 305 | | | | | | | | || | | X lsr x3, x3, #2 | 306 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x8, x11 | 308 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x8, x3] | 309 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w5, [x11, x3] | 310 | | | | 0.50 | 0.500 | | | || | | sub w0, w4, w5 // difference of the mismatching chars | 311 | | | | | | 0.50 | 0.50 | || | | ret | 313 | | | | | | | | || | | .p2align 4 | 314 | | | | | | | | || | | .Lswapped: | 315 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x12, x8, x9 | 316 | | | | 0.75 | 0.750 | 0.25 0.50 | 0.25 0.50 | || | | ldr q0, [x12, #16]! | 317 | | | | 0.50 | 0.500 | | | || | | sub x8, x8, x10 | 318 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x8, x9 | 319 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x2,x2,x9 | 320 | | | | | | | | || | | X neg x9, x9 | 322 | | | | | | | | || | | X cmeq v1.16b, v2.16b, #0 | 323 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v3.16b | 324 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 325 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 326 | | | | | | | | || | | X fmov x6, d1 | 327 | | | | | | | | || | | X fmov x5, d0 | 329 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, #32 | 331 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2 | 332 | | | | | | | | || | | X lsl x4, x13, x4 | 333 | | | | 0.38 | 0.370 | 0.13 | 0.12 | || | | orr x3,x6,x4 // introduce a null byte match | 334 | | | | 0.50 | 0.500 | | | || | | cmp x2,#16 | 335 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt | 336 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2s | 337 | | | | | | | | || | | X mvn x5, x5 | 338 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2s | 340 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #16 | 341 | | | | 0.50 | 0.500 | | | || | | cmp x2, #32 | 342 | | | | | | | | 1.00 || | | b.lt .Ltails | 344 | | | | | | | | || | | .p2align 4 | 345 | | | | | | | | || | | nein: | 346 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11] | 347 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10, x8] | 348 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x10] | 350 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 | 351 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b | 353 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 354 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 355 | | | | | | | | || | | X fmov x6, d1 | 356 | | | | | | | | || | | X fmov x5, d0 | 357 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfounds | 358 | | | | | | | | || | | X mvn x5, x5 | 359 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatchs | 361 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, #16 | 364 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11] | 365 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10, x8] | 366 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q2, [x10] | 368 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | 1.0 | add x10, x10, #16 | 369 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 | 370 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b | 372 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 373 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 374 | | | | | | | | || | | X fmov x6, d1 | 375 | | | | | | | | || | | X fmov x5, d0 | 376 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfound2s | 377 | | | | | | | | || | | X mvn x5, x5 | 378 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatch2s | 379 | | | | 0.50 | 0.500 | | | || | | sub x2, x2, #32 | 380 | | | | 0.25 | 0.750 | | | || | | cmp x2, #32 | 381 | | | | | | | | 1.00 || | | b.ge nein | 383 | | | | | | | | || | | .Ltails: | 384 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11] | 385 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q1, [x10, x8] | 386 | | | | | | 0.51 0.50 | 0.49 0.50 | || | | ldr q2, [x10] | 388 | | | | | | | | || | | X cmeq v1.16b, v1.16b, #0 | 389 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b | 391 | | | | | | | | || | | X shrn v1.8b, v1.8h, #4 | 392 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 393 | | | | | | | | || | | X fmov x6, d1 | 394 | | | | | | | | || | | X fmov x5, d0 | 397 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | lsl x4, x2, #2 | 398 | | | | | | | | || | | X lsl x4, x13, x4 | 399 | | | | 0.50 | 0.000 | 0.24 | 0.26 | || | | orr x3, x6, x4 // introduce a null byte match | 400 | | | | 0.50 | 0.500 | | | || | | cmp x2, #16 | 401 | | | | 0.50 | 0.500 | | | || | | csel x6, x3, x6, lt | 403 | | | | 0.50 | 0.500 | | | || | | cbnz x6, .Lnulfounds | 404 | | | | | | | | || | | X mvn x5, x5 | 405 | | | | 0.74 | 0.260 | | | || | | cbnz x5, .Lmismatchs | 407 | 0.50 | | 0.50 | 0.00 | -0.01 | | | || 1.0 | 1.0 | add x10, x10, #16 | 409 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x11] | 410 | | | | | | 0.50 0.50 | 0.50 0.50 | || 5.0 | | ldr q1, [x10, x8] | 411 | | | | | | 0.51 0.50 | 0.49 0.50 | || | | ldr q2, [x10] | 413 | 0.50 | | 0.50 | 0.00 | -0.01 | | | || | 1.0 | add x10, x10, #16 | 414 | | | | | | | | || 0.0 | | X cmeq v1.16b, v1.16b, #0 | 415 | | | | | | | | || | | X cmeq v0.16b, v0.16b, v2.16b | 417 | | | | | | | | || 0.0 | | X shrn v1.8b, v1.8h, #4 | 418 | | | | | | | | || | | X shrn v0.8b, v0.8h, #4 | 419 | | | | | | | | || 0.0 | | X fmov x6, d1 | 420 | | | | | | | | || | | X fmov x5, d0 | 422 | | | | | | | | || | | X ubfiz x4, x2, #2, #4 | 423 | | | | | | | | || | | X lsl x4, x13, x4 | 424 | | | | 0.00 | 0.510 | 0.23 | 0.26 | || 1.0 | | orr x6, x6, x4 // introduce a null byte match | 426 | | | | | | | | || | | .Lnulfound2s: | 427 | | | | 0.50 | 0.500 | | | || | 1.0 | sub x10, x10, #16 | 428 | | | | | | | | || | | .Lnulfounds: | 429 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || 1.0 | | mov x4, x6 | 431 | | | | | | | | || | | X ubfiz x7, x9, #2, #4 | 432 | | | | | | | | || | | X lsl x6, x6, x7 | 434 | | | | | | | | || | | X orn x5, x6, x5 | 436 | | | | 0.50 | 0.500 | | | || | | cbnz x5, .Lmismatchs | 438 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldr q0, [x10, x9] | 439 | | | | | | 0.50 0.50 | 0.50 0.50 | || | 5.0 | ldr q1, [x10, x8] | 441 | | | | | | | | || | 0.0 | X cmeq v1.16b, v0.16b, v1.16b | 442 | | | | | | | | || | 0.0 | X shrn v1.8b, v1.8h, #4 | 443 | | | | | | | | || | 0.0 | X fmov x5, d1 | 445 | | | | | | | | || 0.0 | 0.0 | X orn x5, x4, x5 | 447 | | | | | | | | || 0.0 | 0.0 | X rbit x3, x5 | 448 | | | | | | | | || 0.0 | 0.0 | X clz x3, x3 | 449 | | | | | | | | || 0.0 | 0.0 | X lsr x5, x3, #2 | 451 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x10, x8 | 452 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x10, x10, x9 | 454 | | | | | | 0.50 0.50 | 0.50 0.50 | || | | ldrb w4, [x10, x5] | 455 | | | | | | 0.50 0.50 | 0.50 0.50 | || 5.0 | 5.0 | ldrb w5, [x11, x5] | 456 | | | | 0.50 | 0.500 | | | || | | sub w0, w5, w4 | 457 | | | | | | 0.50 | 0.50 | || | | ret | 459 | | | | | | | | || | | .p2align 4 | 460 | | | | | | | | || | | .Lmismatch2s: | 461 | | | | 0.50 | 0.500 | | | || | | sub x10, x10, #16 | 462 | | | | | | | | || | | .Lmismatchs: | 463 | | | | | | | | || 0.0 | 0.0 | X rbit x3, x5 | 464 | | | | | | | | || 0.0 | 0.0 | X clz x3, x3 | 465 | | | | | | | | || 0.0 | 0.0 | X lsr x3, x3, #2 | 466 | 0.50 | | 0.50 | 0.00 | 0.000 | | | || | | add x11, x10, x11 | 468 | | | | | | 0.50 0.50 | 0.50 0.50 | || 5.0 | | ldrb w4, [x10, x3] | 469 | | | | | | 0.50 0.50 | 0.50 0.50 | || | 5.0 | ldrb w5, [x11, x3] | 470 | | | | 0.50 | 0.500 | | | || 1.0 | 1.0 | sub w0, w5, w4 | 471 | | | | | | 0.50 | 0.50 | || | | ret | 473 | | | | | | | | || | | .p2align 4 | 474 | | | | | | | | || | | .Lempty: | 475 | | | | | | | | || 0.0 | 0.0 | X eor x0, x0, x0 | 476 | | | | | | 0.50 | 0.50 | || | | ret | 478 | | | | | | | | || | | .section .rodata | 479 | | | | | | | | || | | .p2align 4 | 480 | | | | | | | | || | | shift_data: | 481 | | | | | | | | || | | .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 | 482 | | | | | | | | || | | .fill 16, 1, -1 | 483 | | | | | | | | || | | .size shift_data, .-shift_data | | 18.0 18.0 29.9 29.87 31.1 28.0 31.1 28.0 16.0 29.0 29.0 | | Loop-Carried Dependencies Analysis Report | ----------------------------------------- | 2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 19.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 19.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 15.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 455, 463, 464, 465, 468, 470, 475] | 2 | 15.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 455, 463, 464, 465, 469, 470, 475] | 2 | 11.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 317, 451, 466, 469, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 468, 470, 475] | 2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 469, 470, 475] | 2 | 16.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 451, 466, 469, 470, 475] | 2 | 17.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 452, 461, 466, 469, 470, 475] | 2 | 16.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 158, 291, 329, 361, 368, 407, 413, 427, 452, 461, 468, 470, 475] | 2 | 26.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 26.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 24.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 455, 463, 464, 465, 468, 470, 475] | 2 | 20.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 455, 463, 464, 465, 469, 470, 475] | 2 | 16.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 317, 451, 466, 469, 470, 475] | 2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 29.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 468, 470, 475] | 2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 455, 463, 464, 465, 469, 470, 475] | 2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 451, 466, 469, 470, 475] | 2 | 22.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 452, 461, 466, 469, 470, 475] | 2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 291, 329, 361, 368, 407, 413, 427, 452, 461, 468, 470, 475] | 2 | 27.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 27.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 2 | 25.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 455, 463, 464, 465, 468, 470, 475] | 2 | 21.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 455, 463, 464, 465, 469, 470, 475] | 2 | 17.0 | bic x8, x0, #0xf // x8 is x0 but aligned to the boundary| [2, 168, 199, 206, 247, 253, 267, 292, 301, 317, 451, 466, 469, 470, 475] | 3 | 22.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 3 | 22.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 410, 414, 417, 419, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 3 | 20.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 3 | 20.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 439, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 3 | 16.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 451, 455, 463, 464, 465, 468, 470, 475] | 3 | 16.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 451, 455, 463, 464, 465, 469, 470, 475] | 3 | 12.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 292, 301, 317, 451, 466, 469, 470, 475] | 3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 319, 340, 379, 422, 423, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 319, 340, 379, 422, 423, 424, 429, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 468, 470, 475] | 3 | 17.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 438, 441, 442, 443, 445, 447, 448, 449, 455, 463, 464, 465, 469, 470, 475] | 3 | 10.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 452, 461, 466, 469, 470, 475] | 3 | 9.0 | and x9, x0, #0xf // x9 is the offset| [3, 80, 320, 452, 461, 468, 470, 475] | 7 | 8.0 | subs x2, x2, #1 | [7, 77, 78, 178, 217, 319, 340, 379] `---- It's also accessible here if you want to play around with it. 3 References ============