Getz Mikalsen Table of Contents _________________ 1. GSoC status update #5 (cpy, ccpy, lcpy) 2. Whats been going on the last few weeks .. 1. memccpy .. 2. strlcpy .. 3. strlcat .. 4. memcpy .. 5. coming week .. 6. Benchmarks 3. References 1 GSoC status update #5 (cpy, ccpy, lcpy) ========================================= 04-08-2024 *Table of Contents* - [GSoC status update #5 (cpy, ccpy, lcpy)] - [Whats been going on the last few weeks] - [memccpy] - [strlcpy] - [strlcat] - [memcpy] - [Benchmarks] - [References] [GSoC status update #5 (cpy, ccpy, lcpy)] See section 1 [Whats been going on the last few weeks] See section 2 [memccpy] See section 2.1 [strlcpy] See section 2.2 [strlcat] See section 2.3 [memcpy] See section 2.4 [Benchmarks] See section 2.6 [References] See section 3 2 Whats been going on the last few weeks ======================================== So the last few weeks I've been working on the `*cpy' functions. I gave `strcspn' a try but it's really tricky and I decided to save it to last and perhaps continue with it past the GSoC deadline if needed. I started with `memccpy' and let it be the base for `strlcpy'. I also noticed that `stpncpy' is quite similar but I haven't gotten around to writing it yet although the already done functions will serve as a rough base for it. 2.1 memccpy ~~~~~~~~~~~ It turned out to be rather tricky to get this one right, a classic case of the [Ninety--ninety rule]. It went smooth at first but ironing out little bugs as being off by one here and there really took a while. Having two debuggers open side by side and comparing register values from the amd64 variant to my aarch64 port was really useful. To complicate things further I found a bug in the existing `memccpy' implementation which fuz' took care of. [[1]] This first resulted in a small degradation in performance but after some rework it resulted in an improvement over the original faulty code! :-) Although this meant that I rewrote parts of memccpy 3 times in order to improve perfomance. In the end it turned out quite nice, beating the Scalar performance by over 1000% for longer strings across all processors I was able to benchmark on. Sadly it turned out to be a whole 20% slower on short strings on the Neoverse N1 chip. [Ninety--ninety rule] [[1] 2.2 strlcpy ~~~~~~~~~~~ very similar to memccpy except that we return the length of the `src' string instead of a pointer. This means that we essentially run `strlen' at the same time. This leads to some nice multi-tasking, an idea I had was to use `strlen' implementation which was fast for short to medium strings when the whole source string wasn't copied. My reasoning is that its more common that we generally reach a limit near the end of a string rather than near the beginning. `strlcpy' is presented in this paper [[2]] and recently the linux kernel has been moving away from `strlcpy' in favor of `strscpy'. [[3]] There surely is a meme out there for the naming of each new variant of the string functions to make it more /"safe"/. :-) [[2] [[3] 2.3 strlcat ~~~~~~~~~~~ `strlcat' is rather simple to implement using existing string function. Take the one from FreeBSD's libc. [[4]] ,---- | size_t | strlcat(char *restrict dst, const char *restrict src, size_t dstsize) | { | char *loc = __memchr(dst, '\0', dstsize); | | if (loc != NULL) { | size_t dstlen = (size_t)(loc - dst); | | return (dstlen + __strlcpy(loc, src, dstsize - dstlen)); | } else | return (dstsize + strlen(src)); | } `---- Now the only issue is that I haven't written a SIMD memchr as there is also a well optimized variant in the arm-optimized-routines repository in /contrib. The issue is that I should not touch anything in /contrib as it just pulled from a git repo. And second is that arm labels their memchar __memchr_aarch64 and then the Makefile does a little rewrite. ,---- | .for FUNC in ${AARCH64_STRING_FUNCS} | .if !exists(${FUNC}.S) | ${FUNC}.S: | printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET} | printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET} | printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET} | CLEANFILES+= ${FUNC}.S | .endif `---- The reason why we can't just call `memchr' is that libc functions must be designed such that they do not appear to call other libc functions in case the user overrides any of them. And calling something __something is a rather simple fix. :-) So the solution is to create a little wrapper for memchr, which naturally brings us to memcpy! [[4] 2.4 memcpy ~~~~~~~~~~ I noticed that the memcpy that was currently included in libc wasn't SIMD optimized despite there being an ASIMD `memcpy' variant in /contrib. The reason is that compared to all the other string functions then arm has provided a Scalar `memcpy' and a SIMD `memcpy'. This is probably due to the string functions being grandfathered in from an older optimized routines repository. Anyway, ASIMD is a part of the base ISA of Aarch64 so as long as it's not slower then theres little incentive to not use it. The fix was simple. ,---- | diff --git a/lib/libc/aarch64/string/memcpy.S b/lib/libc/aarch64/string/memcpy.S | --- a/lib/libc/aarch64/string/memcpy.S | +++ b/lib/libc/aarch64/string/memcpy.S | @@ -1,6 +1,6 @@ | -#define __memcpy_aarch64 memcpy | -#define __memmove_aarch64 memmove | -#include "aarch64/memcpy.S" | +#define __memcpy_aarch64_simd memcpy | +#define __memmove_aarch64_simd memmove | +#include "aarch64/memcpy-advsimd.S" `---- 2.5 coming week ~~~~~~~~~~~~~~~ GSoC is slowly nearing its end, there's just over 2 weeks left of the standard (12 week) coding period. These final weeks will consist of me tying up some loose ends such as missing bcmp ifdef's for memcp which bundles both bcmp and memcmp into the same file and finishing the implementation the fancy algorithm for strcspn. 2.6 Benchmarks ~~~~~~~~~~~~~~ There's quite a few benchmarks so it's better to visit the DR for each of them instead. [memcpy] [memccpy] [strlcpy] [memcpy] [memccpy] [strlcpy] 3 References ============