Getz Mikalsen


Table of Contents
_________________

1. GSoC status update #5 (cpy, ccpy, lcpy)
2. Whats been going on the last few weeks
.. 1. memccpy
.. 2. strlcpy
.. 3. strlcat
.. 4. memcpy
.. 5. coming week
.. 6. Benchmarks
3. References


1 GSoC status update #5 (cpy, ccpy, lcpy)
=========================================

  04-08-2024

  <!-- markdown-toc start - Don't edit this section. Run M-x
  markdown-toc-refresh-toc -->

  *Table of Contents*

  - [GSoC status update #5 (cpy, ccpy, lcpy)]
  - [Whats been going on the last few weeks]
    - [memccpy]
    - [strlcpy]
    - [strlcat]
    - [memcpy]
    - [Benchmarks]
  - [References]

  <!-- markdown-toc end -->


[GSoC status update #5 (cpy, ccpy, lcpy)] See section 1

[Whats been going on the last few weeks] See section 2

[memccpy] See section 2.1

[strlcpy] See section 2.2

[strlcat] See section 2.3

[memcpy] See section 2.4

[Benchmarks] See section 2.6

[References] See section 3


2 Whats been going on the last few weeks
========================================

  So the last few weeks I've been working on the `*cpy' functions. I
  gave `strcspn' a try but it's really tricky and I decided to save it
  to last and perhaps continue with it past the GSoC deadline if needed.

  I started with `memccpy' and let it be the base for `strlcpy'. I also
  noticed that `stpncpy' is quite similar but I haven't gotten around to
  writing it yet although the already done functions will serve as a
  rough base for it.


2.1 memccpy
~~~~~~~~~~~

  It turned out to be rather tricky to get this one right, a classic
  case of the [Ninety--ninety rule]. It went smooth at first but ironing
  out little bugs as being off by one here and there really took a
  while. Having two debuggers open side by side and comparing register
  values from the amd64 variant to my aarch64 port was really useful. To
  complicate things further I found a bug in the existing `memccpy'
  implementation which fuz' took care of.  [[1]] This first resulted in
  a small degradation in performance but after some rework it resulted
  in an improvement over the original faulty code! :-)

  Although this meant that I rewrote parts of memccpy 3 times in order
  to improve perfomance. In the end it turned out quite nice, beating
  the Scalar performance by over 1000% for longer strings across all
  processors I was able to benchmark on. Sadly it turned out to be a
  whole 20% slower on short strings on the Neoverse N1 chip.


[Ninety--ninety rule]
<https://en.wikipedia.org/wiki/Ninety-ninety_rule%20%22Ninety–ninety%20rule%22>

[[1] <https://reviews.freebsd.org/D46052>


2.2 strlcpy
~~~~~~~~~~~

  very similar to memccpy except that we return the length of the `src'
  string instead of a pointer. This means that we essentially run
  `strlen' at the same time. This leads to some nice multi-tasking, an
  idea I had was to use `strlen' implementation which was fast for short
  to medium strings when the whole source string wasn't copied. My
  reasoning is that its more common that we generally reach a limit near
  the end of a string rather than near the beginning.

  `strlcpy' is presented in this paper [[2]] and recently the linux
  kernel has been moving away from `strlcpy' in favor of
  `strscpy'. [[3]] There surely is a meme out there for the naming of
  each new variant of the string functions to make it more /"safe"/. :-)


[[2]
<http://www.usenix.org/publications/library/proceedings/usenix99/full_papers/millert/millert.pdf>

[[3] <https://lwn.net/Articles/905777/>


2.3 strlcat
~~~~~~~~~~~

  `strlcat' is rather simple to implement using existing string
  function.  Take the one from FreeBSD's libc.  [[4]]

  ,----
  | size_t
  | strlcat(char *restrict dst, const char *restrict src, size_t dstsize)
  | {
  |     char *loc = __memchr(dst, '\0', dstsize);
  | 
  |     if (loc != NULL) {
  |         size_t dstlen = (size_t)(loc - dst);
  | 
  |         return (dstlen + __strlcpy(loc, src, dstsize - dstlen));
  |     } else
  |         return (dstsize + strlen(src));
  | }
  `----

  Now the only issue is that I haven't written a SIMD memchr as there is
  also a well optimized variant in the arm-optimized-routines repository
  in /contrib. The issue is that I should not touch anything in /contrib
  as it just pulled from a git repo. And second is that arm labels their
  memchar __memchr_aarch64 and then the Makefile does a little rewrite.

  ,----
  | .for FUNC in ${AARCH64_STRING_FUNCS}
  | .if !exists(${FUNC}.S)
  | ${FUNC}.S:
  |     printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
  |     printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
  |     printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
  | CLEANFILES+=    ${FUNC}.S
  | .endif
  `----

  The reason why we can't just call `memchr' is that libc functions must
  be designed such that they do not appear to call other libc functions
  in case the user overrides any of them. And calling something
  __something is a rather simple fix. :-)

  So the solution is to create a little wrapper for memchr, which
  naturally brings us to memcpy!


[[4] <https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/strlcat.c>


2.4 memcpy
~~~~~~~~~~

  I noticed that the memcpy that was currently included in libc wasn't
  SIMD optimized despite there being an ASIMD `memcpy' variant in
  /contrib. The reason is that compared to all the other string
  functions then arm has provided a Scalar `memcpy' and a SIMD
  `memcpy'. This is probably due to the string functions being
  grandfathered in from an older optimized routines repository. Anyway,
  ASIMD is a part of the base ISA of Aarch64 so as long as it's not
  slower then theres little incentive to not use it. The fix was simple.

  ,----
  | diff --git a/lib/libc/aarch64/string/memcpy.S b/lib/libc/aarch64/string/memcpy.S
  | --- a/lib/libc/aarch64/string/memcpy.S
  | +++ b/lib/libc/aarch64/string/memcpy.S
  | @@ -1,6 +1,6 @@
  | -#define    __memcpy_aarch64    memcpy
  | -#define    __memmove_aarch64   memmove
  | -#include "aarch64/memcpy.S"
  | +#define    __memcpy_aarch64_simd   memcpy
  | +#define    __memmove_aarch64_simd  memmove
  | +#include "aarch64/memcpy-advsimd.S"
  `----


2.5 coming week
~~~~~~~~~~~~~~~

  GSoC is slowly nearing its end, there's just over 2 weeks left of the
  standard (12 week) coding period. These final weeks will consist of me
  tying up some loose ends such as missing bcmp ifdef's for memcp which
  bundles both bcmp and memcmp into the same file and finishing the
  implementation the fancy algorithm for strcspn.


2.6 Benchmarks
~~~~~~~~~~~~~~

  There's quite a few benchmarks so it's better to visit the DR for each
  of them instead.

  [memcpy]
  [memccpy]
  [strlcpy]


[memcpy] <https://reviews.freebsd.org/D46251>

[memccpy] <https://reviews.freebsd.org/D46170>

[strlcpy] <https://reviews.freebsd.org/D46243>


3 References
============