Getz Mikalsen
Table of Contents
_________________
1. Hello Beastie and Google Summer of Code!
.. 1. Background
.. 2. Why write assembly
..... 1. Architectural levels
.. 3. amd64 strlen implementation
..... 1. Substituting missing instructions
.. 4. strlen PoC
..... 1. Simple strlen
..... 2. Naïve implementation
..... 3. Improved implementation
..... 4. FCMP to avoid GPR move
.. 5. libc integration
.. 6. Tests
.. 7. Benchmarks
.. 8. What's next
.. 9. References
2. Bonus: Hello World in Aarch64 Assembly
1 Hello Beastie and Google Summer of Code!
==========================================
01-06-2024
*Table of Contents*
- [Hello Beastie and Google Summer of Code!]
- [Background]
- [Why write assembly]
- [Architectural levels]
- [amd64 strlen implementation]
- [Substituting missing instructions]
- [strlen PoC]
- [Simple strlen]
- [Naïve implementation]
- [Improved implementation]
- [FCMP to avoid GPR move]
- [libc integration]
- [Tests]
- [Benchmarks]
- [What's next]
- [References]
- [Bonus: Hello World in Aarch64 Assembly]
I have been admitted to be a part of Google Summer of Code (GSoC)
2024! [[1]]
Participating in GSoC has been on my radar for many years but I always
thought, oh I'll do it next summer, well that summer is now! :D
[Hello Beastie and Google Summer of Code!] See section 1
[Background] See section 1.1
[Why write assembly] See section 1.2
[Architectural levels] See section 1.2.1
[amd64 strlen implementation] See section 1.3
[Substituting missing instructions] See section 1.3.1
[strlen PoC] See section 1.4
[Simple strlen] See section 1.4.1
[Naïve implementation] See section 1.4.2
[Improved implementation] See section 1.4.3
[FCMP to avoid GPR move] See section 1.4.4
[libc integration] See section 1.5
[Tests] See section 1.6
[Benchmarks] See section 1.7
[What's next] See section 1.8
[References] See section 1.9
[Bonus: Hello World in Aarch64 Assembly] See section 2
[1]
1.1 Background
~~~~~~~~~~~~~~
The admission process to GSoC is as follows: Organizations publish
some suggested projects with possible mentors, students get in touch
with said mentors or come up with their own project and convince
someone to mentor them. Then voting takes place and the top N
applicants for the organizations are accepted where N is the numbers
of slots Google has allocated for that organization. So I found a
project that I thought sounded interesting, got in touch with the
mentor and sent in a proposal [[2]] through google's portal, the
project was posted on the FreeBSD GSoC ideas page [[3]]. The project
is to port SIMD enhanced string functions in libc from amd64 (x86_64)
to arm64 (Aarch64).
A great incentive to the program is that you get the chance to be
mentored by member(s) of the community. I have the pleasure to be
mentored by two wonderful people, Robert Clausecker and Ed Maste
. Another GSoC contributor is porting the same algorithms to
RISC-V, he is basing his implementation on the base RISC-V ISA using
SIMD Within A Register (SWAR) techniques, and thus not having any
dependency on processor extensions such as the recently ratified 1.0
RISC-V Vector Extension for which almost no hardware is currently
available (and no hardware running FreeBSD). His blog documenting his
adventures is available at [[4]].
As for me, I've been using FreeBSD for a few years now, before that I
used Linux but after frustration over lack of good documentation and a
fragmented system I switched and haven't looked back. I still enjoy
the linux kernel but userland and distros is something I just want out
of my way.
But this is not an article about why FreeBSD is superior to Linux, if
it even is (yes, in some regards, but ymmv). But rather about my
project for the summer.
[2]
[3]
[4]
1.2 Why write assembly
~~~~~~~~~~~~~~~~~~~~~~
Most libc functions on other platforms already benefit from being
handwritten in assembly, both scalar and SIMD variants. Using SIMD
instructions for string functions are particularly unfit for a
autovectorizing compiler as we make atypical use of SIMD instructions.
For the scalar implementations we may use some Bit Twiddling Hacks
such as those on Sean Eron Anderson's site [[5]].
Compilers also struggle reasoning to decide which operations to do in
GPRs and which to do in vector registers. Register allocation is also
a problem for amd64 and the compiler may spill into the stack whereas
for handwritten assembly you would be left with registers to
spare. This is not really as extreme of a case on arm64 as we have way
more registers than amd64 to play around with.
Another compelling reason to have performance critical libc functions
written in assembly is that all programs that link against libc will
benefit from these improvements. Although this will put some
additional pressure on me as an implementer as the code cant break
other peoples programs just because they abused libc in interesting
ways. An example of that is how memcmp on FreeBSD differs from the
ISO/IEC 9899:1999 requirement's. In particular FreeBSD documents that
memcmp returns the difference between the first two mismatching
characters, as opposed to merely returning a negative/positive integer
or zero.
But my project will solely deal with using Arm NEON instructions to
simd-ify the string functions, although bit-twiddling GPRs does sound
enticing.
[5]
1.2.1 Architectural levels
--------------------------
The AMD64 SysV ABI supplement defines the following architecture
levels, where for FreeBSD we have implementations for most of the
string functions in libc for scalar and baseline. Users are able to
choose which level of enhancements to use using the `ARCHLEVEL'
flag. A complete list of enhanced functions are available in the
simd(7) manpage [[6]].
,----
| scalar scalar enhancements only (no SIMD)
|
| baseline cmov, cx8, x87 FPU, fxsr, MMX, osfxsr, SSE, SSE2
|
| x86-64-v2 cx16, lahf/sahf, popcnt, SSE3, SSSE3, SSE4.1, SSE4.2
|
| x86-64-v3 AVX, AVX2, BMI1, BMI2, F16C, FMA, lzcnt, movbe, osxsave
|
| x86-64-v4 AVX-512F/BW/CD/DQ/VL
`----
[6]
1.3 amd64 strlen implementation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The amd64 strlen implementation consists of two parts, a scalar one
(using bit-twiddling) and a Vectorized one (baseline). I'll focus on
the SIMD one here, the interested reader can navigate to
/usr/src/lib/libc/amd64/string or point their browser to [[7]] for the
scalar implementation.
,----
| ARCHENTRY(strlen, baseline)
| mov %rdi, %rcx
| pxor %xmm1, %xmm1
| and $~0xf, %rdi # align string
| pcmpeqb (%rdi), %xmm1 # compare head (with junk before string)
| mov %rcx, %rsi # string pointer copy for later
| and $0xf, %ecx # amount of bytes rdi is past 16 byte alignment
| pmovmskb %xmm1, %eax
| add $32, %rdi # advance to next iteration
| shr %cl, %eax # clear out matches in junk bytes
| test %eax, %eax # any match? (can't use ZF from SHR as CL=0 is possible)
| jnz 2f
|
| ALIGN_TEXT
| 1: pxor %xmm1, %xmm1
| pcmpeqb -16(%rdi), %xmm1 # find NUL bytes
| pmovmskb %xmm1, %eax
| test %eax, %eax # were any NUL bytes present?
| jnz 3f
|
| /* the same unrolled once more */
| pxor %xmm1, %xmm1
| pcmpeqb (%rdi), %xmm1
| pmovmskb %xmm1, %eax
| add $32, %rdi # advance to next iteration
| test %eax, %eax
| jz 1b
|
| /* match found in loop body */
| sub $16, %rdi # undo half the advancement
| 3: tzcnt %eax, %eax # find the first NUL byte
| sub %rsi, %rdi # string length until beginning of (%rdi)
| lea -16(%rdi, %rax, 1), %rax # that plus loc. of NUL byte: full string length
| ret
|
| /* match found in head */
| 2: tzcnt %eax, %eax # compute string length
| ret
| ARCHEND(strlen, baseline)
`----
Most of these instructions aren't anything odd,
`MOV',=XOR=,=AND=,=SHR=, but what stands out is `PCMPEQB' and
`PMOVMSKV'.
`PMOVMSKB' [[8]] is one of the most useful instructions for finding
where the the index of our `NULL' character. The string functions in
libc obviously function on C style string so they are `NULL'
terminated. So what we do there is a compare followed by figuring out
where the match was.
But heres the kicker, there is no `PMOVMSKB' instruction for Aarch64
which has caused a whole lot of headache. Im basing this on the amount
of posts online regarding substitutions for `PMOVMSKB' on Aarch64
whereas assembly instructions are otherwise often rarely complained
about. [[9]][[10]][[11]]
[7]
[8]
[9]
[10]
[11]
1.3.1 Substituting missing instructions
---------------------------------------
The most promising substitution to `PMOVMSKB' appears to be
`SHRN'. With it we can take a 128-bit vector, shift by #imm and
truncate to 8 bits. So basically end up with a mask of either chunks
of all 0's or all 1's, then by truncation we end up with single but
halfbytes which correspond to whether we had a match or not. Hopefully
thats comes out clearly, otherwise there is an excellent video
courtesy of [[10]] which shows it in action.
[10]
1.4 strlen PoC
~~~~~~~~~~~~~~
Here is an evolution of my attempt of porting strlen to Aarch64. They
also come in the form of a git repository [[12]] where I keep my
experiments before theyre ready to be integrated into libc at my fork
of freebsd-src that I keep on github. [[13]]
It's also good to know while reading this code that registers `x9-x15'
are "corruptible" meaning that a function can change them to whatever
without having to restore them afterwards. Registers `d0-d7' are
"parameter and results registers". So according to the `ARM procedure
call standard', `x0' etc. is where the input to a function is stored
and you can do whatever you want with `x9-x15' without breaking
anything.
[12]
[13]
1.4.1 Simple strlen
-------------------
The most simple variant simply checks the first chunk without any
loop. This simple example will only give valid results for short
strings which are 16 byte aligned. That is that the data begins on a
memory address that is a multiple of 16. You can create such a string
to test it out like this:
,----
| #include
| #include
| #include
|
| extern size_t _strlen(const char * ptr);
|
| int
| main() {
| alignas(16) char string[] = "str";
| printf("strlen: %zu\n", _strlen(string));
| }
`----
Now for the simple strlen implementation. Note that `'
has nice little macros `ENTRY()' and `END()' which are used
throughout.
,----
| ENTRY(_strlen)
| BIC x10,x0,#0xf // alignment
| LDR q0,[x10] // load input to Vector register
| CMEQ v0.16b,v0.16b,#0 // look for 0's
| SHRN v0.8b,v0.8h,#4 // ^
| FMOV x0,d0 // move to GPR
| RBIT x0,x0 // reverse bits as NEON has no ctz
| CLZ x0,x0 // count leading zeros
| LSR x0,x0,#2 // get offset index
| RET
| END(_strlen)
`----
1.4.2 Naïve implementation
--------------------------
Now for a simple but naïve solution to creating a loop and handling
strings which are not already 16 byte aligned. We simply calculate the
offset to the nearest 16 byte boundary and traverse to the boundary
with a Scalar implementation and then turn to the SIMD variant.
,----
| ENTRY(_strlen)
| BIC x10,x0,#0xf
| AND x9,x0,#0xf
| MOV x11,#0
| CBZ x9,.Laligned_loop
|
| .Lunaligned_start:
| LDR x5,[x10]
| ADD x10,x10,#1
| CBZ x5,.Lfound_null
| SUB x9,x9,#1
| CBNZ x9,.Lunaligned_start
| B .Laligned_loop
|
| .Lnext_it:
| ADD x11,x11,#16
| ADD x10,x10,#16
|
| .Laligned_loop:
| LDR q0,[x10]
| CMEQ v0.16b,v0.16b,#0
| SHRN v0.8b,v0.8h,#4
| FMOV x0,d0
| CBZ x0,.Lnext_it
| RBIT x0,x0
| CLZ x0,x0
| LSR x0,x0,#2
| ADD x0,x0,x11
| RET
|
| .Lfound_null:
| RET
| END(_strlen)
`----
This doesn't make use of handy Aarch64 instruction paramaters such as
`LDR qreg,[xreg,#imm]!' which increments by the immediate value before
the load.
1.4.3 Improved implementation
-----------------------------
Now we use improve our previous implementation by using SIMD
Instructions for the first chunk. We use a GPR load after the first
check to improve performance for short strings as a GPR move is
required for the results and statistically short strings are the most
common in real world scenarios. This statement is backed by a survey
conducted by the LLVM project. +I have not been able to find a direct
link to those results, but I will update this post when I find them.+
See:
,----
| ENTRY(_strlen)
| BIC x10,x0,#0xf
| LDR q0,[x10]
| CMEQ v0.16b,v0.16b,#0
| SHRN v0.8b,v0.8h,#4
| FMOV x1,d0 // move to GPR
| LSL x2,x0,#2 // get the byte offset of the last processed
| LSR x1,x1,x2 // align with offset
| CBZ x1,.Lloop // jump if no hit
| RBIT x1,x1
| CLZ x0,x1
| LSR x0,x0,#2
| RET
|
| .Lloop:
| LDR q0,[x10,#16]! // increment by 16 and load
| CMEQ v0.16b,v0.16b,#0
| SHRN v0.8b,v0.8h,#4
| fmov x1,d0 // get offset in case of hit
| cbz x1,.Lloop // x1 is zero if no hit in segment
| .Ldone:
| SUB x0,x10,x0
| RBIT x1,x1
| CLZ x3,x1
| LSR x3,x3,#2
| ADD x0,x0,x3
| RET
| END(_strlen)
`----
1.4.4 FCMP to avoid GPR move
----------------------------
After improving on the naïve implementation I realized that we can
avoid a move from a SIMD register to a GPR by using FCMP in the
loop. I also realized that we can avoid a few instructions if the
input is already 16 byte aligned (see `.Laligned'), although it does
introduce a new branch to be resolved.
The later benchmarks will indicate whether or not this is a worthwhile
improvement.
,----
| ENTRY(_strlen)
| BIC x10,x0,#0xf
| AND x9,x0,#0xf
| LDR q0,[x10]
| CMEQ v0.16b,v0.16b,#0
| SHRN v0.8b,v0.8h,#4
| CBZ x9,.Laligned
| FMOV x1,d0
| LSL x2,x0,#2
| LSR x1,x1,x2
| CBZ x1,.Lloop
| RBIT x1,x1
| CLZ x0,x1
| LSR x0,x0,#2
| RET
|
| .Laligned:
| FMOV x1,d0
| CBNZ x1,.Ldone
|
| .Lloop:
| LDR q0,[x10,#16]!
| CMEQ v0.16b,v0.16b,#0
| SHRN v0.8b,v0.8h,#4
| fcmp d0,#0.0
| B.EQ .Lloop
| FMOV x1,d0
| .Ldone:
| SUB x0,x10,x0
| RBIT x1,x1
| CLZ x3,x1
| LSR x3,x3,#2
| ADD x0,x0,x3
| RET
| END(_strlen)
`----
Now we can look at what further improvements can be done. We could
avoid loop carried dependencies, for each iteration a post increment
is currently used to go to next iteration, we could unroll the loop
twice and increment x10 every two iterations to make it easier for the
CPU to run two iterations at once. Aarch64 also has instructions for
loading several SIMD registers at once using the `LD1',=LD2=... family
of instructions. [[16]]
[16]
1.5 libc integration
~~~~~~~~~~~~~~~~~~~~
Getting a string function written in assembly integrated into libc on
FreeBSD isn't as big of an ordeal as it may sound like. We simply use
the `ENTRY()' and `END()' macros in the code and add the filenames to
the associated `Makefile.inc' located at
`lib/libc/aarch64/string/Makefile.inc' like the following:
,----
| @@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
| strchrnul \
| strcmp \
| strcpy \
| - memcmp \
| strncmp \
| strnlen \
| strrchr
|
| +MDSRCS+= \
| + memcmp.S
| #
| # Add the above functions. Generate an asm file that includes the needed
| # Arm Optimized Routines file defining the function name to the libc name.
`----
We can then build libc as a shared library and load it using
`LD_PRELOAD' for running regression tests as running FreeBSD with a
broken libc makes FreeBSD very sad and prone to severe errors. It's
always nice to avoid a broken install while debugging.
Building libc is as simple as
,----
| cd /usr/src/lib/libnetbsd && make
| cd /usr/src/lib/libc && make
|
| # OR
|
| make -C /usr/src/lib/libc MAKEOBJDIRPREFIX=/tmp/objdir WITHOUT_TESTS=yes
`----
As for debugging it's as simple as loading up a test binary with
`lldb' and setting a breakpoint for the string function being
developed, `strlen' in our case.
1.6 Tests
~~~~~~~~~
FreeBSD comes bundled with an excellent Test Suite [[14]], they are
written using a test framework `Kyua' with the `ATF' library. The
FreeBSD wiki page has all the information necessary for this and
there's no need for my to repeat it here. :-)
Running the tests is as simple as going to
`/usr/src/lib/libc/tests/string' and running `make check'. If you
haven't already run `buildworld' then you will need to build
`lib/libnetbsd' first. This as FreeBSD also borrows some tests from
upstream NetBSD located at
`/usr/src/contrib/netbsd-tests/lib/libc/string'.
[14]
1.7 Benchmarks
~~~~~~~~~~~~~~
Benchmarks are executed using fuz' strperf program [[17]], it's output
is compatible with benchstat from /devel/go-perf. I benchmark all the
implementations against the implementations in libc, I also tried
running the code on a Raspberry Pi 5 running debian to see strlen
holds up against glibc's implementation.
I have benchmarked the previously described implementations. Seeing
performance impact of substituting a GPR move followed by a `bnz'
compared to a `FCMP' followed by a `b.eq' and the impact of branching
immediately in the case of an already aligned string.
It's also important to note that these implementations are hardware
dependent as different cores may utilize more or fewer pipelines for
specific instructions.
To test against glibc I borrowed a Raspbery Pi5 running debian and
installed `bsd-make' to compile strperf, `apt install bmake'.
Then to generate benchmark results it's as simple as:
`for i in {1..20}; do ./strlen >> results/${TEST}; done'
This produced the following results when evaluated with
`benchstat'. You might need to scroll horizontally to view all the
results.
,----
| os: FreeBSD
| arch: arm64
| cpu: ARM Cortex-A76 r4p1
| │ libc_Scalar │ libc_ARM │ GPR │ GPR_aligned │ FCMP │ FCMP_aligned │
| │ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ sec/op vs base │ sec/op vs base │
| Short 186.9µ ± 1% 134.6µ ± 0% -28.01% (p=0.000 n=20) 121.0µ ± 0% -35.26% (p=0.000 n=20) 118.5µ ± 0% -36.62% (p=0.000 n=20) 121.8µ ± 0% -34.85% (p=0.000 n=20) 119.8µ ± 0% -35.91% (p=0.000 n=20)
| Mid 45.05µ ± 1% 37.07µ ± 0% -17.73% (p=0.000 n=20) 33.36µ ± 0% -25.96% (p=0.000 n=20) 30.43µ ± 1% -32.45% (p=0.000 n=20) 33.37µ ± 0% -25.93% (p=0.000 n=20) 29.98µ ± 1% -33.45% (p=0.000 n=20)
| Long 13.894µ ± 0% 4.442µ ± 0% -68.03% (p=0.000 n=20) 6.978µ ± 0% -49.78% (p=0.000 n=20) 6.977µ ± 0% -49.79% (p=0.000 n=20) 6.852µ ± 0% -50.68% (p=0.000 n=20) 5.627µ ± 0% -59.50% (p=0.000 n=20)
| geomean 48.91µ 28.09µ -42.58% 30.43µ -37.79% 29.30µ -40.09% 30.31µ -38.03% 27.24µ -44.31%
|
| │ libc_Scalar │ libc_ARM │ GPR │ GPR_aligned │ FCMP │ FCMP_aligned │
| │ B/s │ B/s vs base │ B/s vs base │ B/s vs base │ B/s vs base │ B/s vs base │
| Short 637.7Mi ± 1% 885.9Mi ± 0% +38.91% (p=0.000 n=20) 985.1Mi ± 0% +54.47% (p=0.000 n=20) 1006.1Mi ± 0% +57.77% (p=0.000 n=20) 978.9Mi ± 0% +53.49% (p=0.000 n=20) 995.0Mi ± 0% +56.02% (p=0.000 n=20)
| Mid 2.584Gi ± 1% 3.141Gi ± 0% +21.55% (p=0.000 n=20) 3.490Gi ± 0% +35.07% (p=0.000 n=20) 3.825Gi ± 1% +48.03% (p=0.000 n=20) 3.488Gi ± 0% +35.01% (p=0.000 n=20) 3.883Gi ± 1% +50.26% (p=0.000 n=20)
| Long 8.379Gi ± 0% 26.210Gi ± 0% +212.81% (p=0.000 n=20) 16.684Gi ± 0% +99.12% (p=0.000 n=20) 16.686Gi ± 0% +99.15% (p=0.000 n=20) 16.990Gi ± 0% +102.77% (p=0.000 n=20) 20.690Gi ± 0% +146.93% (p=0.000 n=20)
| geomean 2.380Gi 4.145Gi +74.15% 3.826Gi +60.76% 3.973Gi +66.92% 3.841Gi +61.37% 4.274Gi +79.56%
|
| os: Linux
| arch: aarch64
| │ strlen_glibc │
| │ sec/op │
| Short 132.1µ ± 0%
| Mid 36.29µ ± 1%
| Long 4.365µ ± 4%
| geomean 27.55µ
|
| │ strlen_glibc │
| │ B/s │
| Short 902.7Mi ± 0%
| Mid 3.208Gi ± 1%
| Long 26.67Gi ± 4%
| geomean 4.225Gi
`----
[17]
1.8 What's next
~~~~~~~~~~~~~~~
Despite resulting in worse perfomance for longer strings I will now
continue the porting effort and translate `memcmp' to Aarch64 NEON. I
will continue optimizing `strlen' by unrolling the main loop twice as
previously mentioned. So instead of the current `LDR q0,[x10,#16]!'
I'll do a `LDP q1, q2, [x10, #32]!'
Hopefully I can get that strlen done by the end of the week and submit
it for review to the FreeBSD Phabricator instance.
I also attended the LundLinuxCon [[18]] last week and saw a really
interesting talk regarding safer flexible arrays in the Linux kernel
[[19]]. It involved a new warning flag and some niceties that were
recently merged into GCC15 [[20]] and LLVM18 [[21]], I will try
building FreeBSD with those flags and see how many warning are
present, but first I need to read up a littlebit whether or not
FreeBSD even permits the use of flexible arrays in the kernel. :-)
[18]
[19]
[20]
[21]
1.9 References
~~~~~~~~~~~~~~
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
[[15]]
[[16]]
[[17]]
[[18]]
[[19]]
[[20]]
[[21]]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
2 Bonus: Hello World in Aarch64 Assembly
========================================
The FreeBSD Developers' Handbook has a section on writing assembly,
but it's targeted towards x86 and is rather outdated. Using it's
suggestions Hello World would look like this.
,----
| .text
| .global _start
|
| kernel:
| int $0x80
| ret
|
| _start:
| mov $4, %rax
| mov $1, %rdi
| mov $message, %rsi
| mov $13, %rdx
| call kernel
| mov $1, %rax
| mov $69, %rdi
| syscall
|
| message:
| .ascii "Hello, world\n"
`----
But `INT 80h' is much slower than `syscall' or simply using `svc #0'
on arm64, this is because a lot of microcode is run when `INT' is
executed, whereas the microcode for `syscall' is much
simpler. Although `INT 80h' still works on amd64!
Whereas `0x80' is the i386 syscall interface it incidentally works for
amd64 tasks because we don't check if the process doing a syscall is a
32 bit or 64 bit one. But arguments are truncated to 32 bits, so you
can't e.g. pass pointers to the stack! And arm64 ofcourse doesn't have
the `INT' instruction.
In general, the method of doing syscalls is different on each
architecture. FreeBSD is moving towards what win32 and solaris already
pioneered: syscalls should be done by calling library functions so the
kernel ABI and API can be adapted in the future. For this reason
syscalls will be split into a new library `libsys' in FreeBSD 15.
If you're wondering how to figure out what number each syscall
corresponds to then you can check
,----
| /*
| Compile with the following for non Aarch64 host:
| aarch64-unknown-freebsd14.0-gcc13 --sysroot /usr/local/freebsd-sysroot/aarch64 hello.S -nostdlib
|
| Run with:
| qemu-aarch64-static ./a.out
| */
|
| .text
|
| /* Our application's entry point. */
| .global _start
|
| _start:
| /* syscall write(int fd, const void *buf, size_t count) */
| mov x0, #1 /* fd := STDOUT_FILENO */
| ldr x1, = msg /* buf := msg */
| ldr x2, = len /* count := len */
| mov w8, #4 /* write is syscall #4 */
| svc #0 /* invoke syscall */
|
| /* syscall exit(int status) */
| mov x0, #69 /* status := 69 */
| mov w8, #1 /* exit is syscall #1 */
| svc #0 /* invoke syscall */
|
| /* Data segment: define our message string and calculate it's length. */
| .data
| msg:
| .ascii "Hello, world\n"
| len = . - msg
`----