When I was working on a generic port of Google's hashmap to C, I wrote a function that (ignoring irrelevant parts) looked like this:
typedef struct {
uint8_t *bytes;
size_t len;
} bytebuf;
void do_things(bytebuf *b) {
uint8_t *i = b->bytes;
for (; i < b->bytes + b->len; i++) {
*i |= 0xAA;
}
}
Just iterate over the buffer, doing a trivial bitop on each byte, how complicated can it be? I soon became curious to see how compilers would vectorize this code for different architectures. The first target I tried was x86_64 with AVX2 for which recent Clang gave me the following assembly (annotated by me):
do_things:
cmp qword ptr [rdi + 8], 0 ;
jle .LBB0_3 ; if ((uint64_t)b->len < 0) goto endloop
mov rax, qword ptr [rdi] ; rax = b->bytes
.LBB0_2: ;loop:
or byte ptr [rax], -86 ; *rax |= 0xAA
inc rax ; rax++
mov rcx, qword ptr [rdi] ; rcx = b->bytes // ?!
add rcx, qword ptr [rdi + 8] ; rcx += b->len // ?!
cmp rax, rcx ;
jb .LBB0_2 ; if (rax < rcx) goto loop
.LBB0_3: ;endloop:
ret ; return
... and that got me confused for a few seconds. Why is it not vectorized? And worse, why does it fetch b->len
and b->bytes
from memory on every iteration? Why would Clang have a problem doing basic optimizations here? Well...
The thing is, that code above modifies some uint8_t
in memory, and uint8_t
is just a typedef to unsigned char
. char
is a
special type in both C and C++: it is the only type that is allowed to read and modify memory occupied by other
types. So compiler has to account for the bizarre case where *i |= 0xAA
might change the value of b->len
or b->bytes
!
This is why it can't pre-compute b->bytes + b->len
and that stops it from unrolling the loop. Be it uint16_t
instead of
uint8_t
for example, there would be nothing to stop Clang and GCC from unrolling this code however they like (at least as
long as uint16_t
and size_t
are not the same type under the hood). Ok, but how to keep using uint8_t
and allow the
compiler to optimize the code properly?
If you rewrite example code to
void do_things(bytebuf *b) {
uint8_t *i = b->bytes;
uint8_t *end = b->bytes + b->len;
for (; i < end; i++) {
*i |= 0xAA;
}
}
compiler will be able to unroll this loop as loop condition now depends on local variables that b->bytes
can't possibly alias.
Unfortunately, converting existing code to this approach can be a bit too tedious when you have a lot of such code.
restrict
is a type qualifier that applies to pointers, asserting that the memory to which this pointer points can only be accessed
by this pointer or pointers derived from it
(see restrict on cppreference for details). So just
by adding restrict
to the declaration of function parameter like this:
void do_things(bytebuf *restrict b) { // ...
you allow the compiler to assume that b->bytes
and b->len
won't be modified in the loop, and thus let it unroll the loop.
Although restrict
is not officially part of C++, it is available as an extension in Clang, GCC, MVC as __restrict
.
Unfortunately, fixing existing code with restrict
can also be a lot of boring and demanding work, as you would need to
carefully examine the code where you are adding restrict
and how that code is used from other parts of the program, because
misusing restrict
will bring you very nasty bugs (see CERT entry on that for more info).
C++20 introduced char8_t
as a new fundamental type. It does not have the special aliasing properties of char
, so replacing
uint8_t
with char8_t
in the example code will make compilers produce properly optimized code. There is a catch however:
char8_t
is unsigned, and attempts to make it signed using signed char8_t
will be rejected by compilers. So extra casts to
int8_t
will be needed once char8_t
is retrieved from memory when you need to treat its value as signed. In C,
Clang makes char8_t
available as a fundamental type like in C++ when -fchar8_t
command-line option is passed, but it does
not work with GCC, and anyway, this feature is going to conflict with the upcoming C23 where char8_t
will be added as a typedef to
unsigned char
in the <uchar.h>
header.
The next C standard, C23 will include a new feature called "bit-precise integers". It allows us to have integer types with
custom bit-widths through the syntax _BitInt(N)
where N
is a constant expression indicating the amount of bits this integer
type has. It is signed by default but can be made unsigned with the unsigned
keyword. So _BitInt(8)
would be an 8-bit integer
like char
, except that since it's not a char
, it is not allowed to alias other types, thus allowing for better code to be generated.
Just changing types from uint8_t
to _BitInt(8)
in the code from the beginning of this article makes compiler generate
vectorized code. Using this approach to fix existing codebase is simple and safe. Unfortunately, current GCC version 13 does
not support _BitInt
, Clang supports this feature in C and as an extension in C++ since version 14 released in 25 March 2022. So
if you are stuck with older compiler versions, you are out of luck.
unique_ptr<int8_t>
and vector<int8_t>
as well as their std::byte
and char
counterparts, also suffer from the char aliasing
problem. That is, assignment to the element of vector<int8_t>
makes compilers assume that (almost) all memory has been clobbered.
Rust is pretty strict on aliasing: at most one mutable reference to a part of memory can exist. This is backed up by static
checks in compiler that report as error any attempt to create more than one mutable reference to a value in memory (at least
within the "Safe" Rust). Of course, Rustc's optimizer takes advantage of uniqueness of mutable references.
D language does not impose even cross-type aliasing restrictions, except in case of
vector operations where involved slices must not overlap.
Zig allows all kinds of aliasing unless pointer is marked by noalias
undocumented keyword. Aliasing rules in Zig are
probably subject to change as the language is still in beta.
approach | fitness | safety | protability |
---|---|---|---|
keeping data local | works only when you can fit this data on stack | safe | maximum |
restrict |
made for this, but breaks easily | unsafe | C99, in C++ as extension |
char8_t |
can only be unsigned, two buffers are still alowed to alias each other | safe | since C++20, C++ only |
_BitInt(8) |
two buffers are still alowed to alias each other | safe | C23, in C++ as extension. not yet in GCC |
Does the C or C++ codebase you have worked on have performance-sensitive code that writes through int8_t
or char
pointers
or vectors somewhere in the middle?
- Thinking about it, it is weird that C/C++ code working on bytes in memory is inherently harder for compiler to optimize compared to other integer widths due to how languages are defined. For C it was so for around 34 years from the first standard C89 until C23.
restrict
was introduced in C99 and was at least partially motivated by performance advantage Fortran had over C: Fortran language standard forbids passing overlapping arrays to functions, which gave Fortran compilers more freedom in reordering and vectorizing code that operated on multiple arrays. Later GCC (since version 4.3 released in March 2008) and Clang (3.3 - June 2013) started to insert runtime checks for array overlap that would branch into vectorized code when it was safe to and to unvectorized code otherwise.
Great post, but one thing to note is that C++20's
std::span
seems to fix this for allchar
derivatives. At least based on my testing here: https://godbolt.org/z/ddaW1cs5q