Skip to content

Instantly share code, notes, and snippets.

@alexei-zaripov
Last active June 6, 2024 21:19
Show Gist options
  • Save alexei-zaripov/dcc14c78819c5f1354afe8b70932007c to your computer and use it in GitHub Desktop.
Save alexei-zaripov/dcc14c78819c5f1354afe8b70932007c to your computer and use it in GitHub Desktop.

C/C++ performance pitfall: int8_t, aliasing and the ways out

When I was working on a generic port of Google's hashmap to C, I wrote a function that (ignoring irrelevant parts) looked like this:

typedef struct {
    uint8_t *bytes;
    size_t len;
} bytebuf;

void do_things(bytebuf *b) {
    uint8_t *i = b->bytes;
    for (; i < b->bytes + b->len; i++) {
        *i |= 0xAA;
    }
}

Just iterate over the buffer, doing a trivial bitop on each byte, how complicated can it be? I soon became curious to see how compilers would vectorize this code for different architectures. The first target I tried was x86_64 with AVX2 for which recent Clang gave me the following assembly (annotated by me):

do_things:
        cmp     qword ptr [rdi + 8], 0    ;  
        jle     .LBB0_3                   ;  if ((uint64_t)b->len < 0)  goto endloop
        mov     rax, qword ptr [rdi]      ;  rax = b->bytes
.LBB0_2:                                  ;loop:
        or      byte ptr [rax], -86       ;  *rax |= 0xAA
        inc     rax                       ;  rax++
        mov     rcx, qword ptr [rdi]      ;  rcx = b->bytes  // ?!
        add     rcx, qword ptr [rdi + 8]  ;  rcx += b->len   // ?!
        cmp     rax, rcx                  ;  
        jb      .LBB0_2                   ;  if (rax < rcx)  goto loop
.LBB0_3:                                  ;endloop:
        ret                               ;  return

... and that got me confused for a few seconds. Why is it not vectorized? And worse, why does it fetch b->len and b->bytes from memory on every iteration? Why would Clang have a problem doing basic optimizations here? Well...

Problem: pointers to char have special aliasing properties

The thing is, that code above modifies some uint8_t in memory, and uint8_t is just a typedef to unsigned char. char is a special type in both C and C++: it is the only type that is allowed to read and modify memory occupied by other types. So compiler has to account for the bizarre case where *i |= 0xAA might change the value of b->len or b->bytes! This is why it can't pre-compute b->bytes + b->len and that stops it from unrolling the loop. Be it uint16_t instead of uint8_t for example, there would be nothing to stop Clang and GCC from unrolling this code however they like (at least as long as uint16_t and size_t are not the same type under the hood). Ok, but how to keep using uint8_t and allow the compiler to optimize the code properly?

Approach 1: keep the data in local variables so compiler can assume it is not aliased

If you rewrite example code to

void do_things(bytebuf *b) {
    uint8_t *i = b->bytes;
    uint8_t *end = b->bytes + b->len;
    for (; i < end; i++) {
        *i |= 0xAA;
    }
}

compiler will be able to unroll this loop as loop condition now depends on local variables that b->bytes can't possibly alias. Unfortunately, converting existing code to this approach can be a bit too tedious when you have a lot of such code.

Approach 2: use restrict keyword

restrict is a type qualifier that applies to pointers, asserting that the memory to which this pointer points can only be accessed by this pointer or pointers derived from it (see restrict on cppreference for details). So just by adding restrict to the declaration of function parameter like this:

void do_things(bytebuf *restrict b) { // ...

you allow the compiler to assume that b->bytes and b->len won't be modified in the loop, and thus let it unroll the loop. Although restrict is not officially part of C++, it is available as an extension in Clang, GCC, MVC as __restrict. Unfortunately, fixing existing code with restrict can also be a lot of boring and demanding work, as you would need to carefully examine the code where you are adding restrict and how that code is used from other parts of the program, because misusing restrict will bring you very nasty bugs (see CERT entry on that for more info).

Approach 3: use char8_t instead of int8_t and char (C++ only)

C++20 introduced char8_t as a new fundamental type. It does not have the special aliasing properties of char, so replacing uint8_t with char8_t in the example code will make compilers produce properly optimized code. There is a catch however: char8_t is unsigned, and attempts to make it signed using signed char8_t will be rejected by compilers. So extra casts to int8_t will be needed once char8_t is retrieved from memory when you need to treat its value as signed. In C, Clang makes char8_t available as a fundamental type like in C++ when -fchar8_t command-line option is passed, but it does not work with GCC, and anyway, this feature is going to conflict with the upcoming C23 where char8_t will be added as a typedef to unsigned char in the <uchar.h> header.

Approach 4: use _BitInt(8) instead of char

The next C standard, C23 will include a new feature called "bit-precise integers". It allows us to have integer types with custom bit-widths through the syntax _BitInt(N) where N is a constant expression indicating the amount of bits this integer type has. It is signed by default but can be made unsigned with the unsigned keyword. So _BitInt(8) would be an 8-bit integer like char, except that since it's not a char, it is not allowed to alias other types, thus allowing for better code to be generated. Just changing types from uint8_t to _BitInt(8) in the code from the beginning of this article makes compiler generate vectorized code. Using this approach to fix existing codebase is simple and safe. Unfortunately, current GCC version 13 does not support _BitInt, Clang supports this feature in C and as an extension in C++ since version 14 released in 25 March 2022. So if you are stuck with older compiler versions, you are out of luck.

C++ considerations

unique_ptr<int8_t> and vector<int8_t> as well as their std::byte and char counterparts, also suffer from the char aliasing problem. That is, assignment to the element of vector<int8_t> makes compilers assume that (almost) all memory has been clobbered.

In other languages

Rust is pretty strict on aliasing: at most one mutable reference to a part of memory can exist. This is backed up by static checks in compiler that report as error any attempt to create more than one mutable reference to a value in memory (at least within the "Safe" Rust). Of course, Rustc's optimizer takes advantage of uniqueness of mutable references.
D language does not impose even cross-type aliasing restrictions, except in case of vector operations where involved slices must not overlap.
Zig allows all kinds of aliasing unless pointer is marked by noalias undocumented keyword. Aliasing rules in Zig are probably subject to change as the language is still in beta.

Summary

approach fitness safety protability
keeping data local works only when you can fit this data on stack safe maximum
restrict made for this, but breaks easily unsafe C99, in C++ as extension
char8_t can only be unsigned, two buffers are still alowed to alias each other safe since C++20, C++ only
_BitInt(8) two buffers are still alowed to alias each other safe C23, in C++ as extension. not yet in GCC

Closing message

Does the C or C++ codebase you have worked on have performance-sensitive code that writes through int8_t or char pointers or vectors somewhere in the middle?

Historical notes

  1. Thinking about it, it is weird that C/C++ code working on bytes in memory is inherently harder for compiler to optimize compared to other integer widths due to how languages are defined. For C it was so for around 34 years from the first standard C89 until C23.
  2. restrict was introduced in C99 and was at least partially motivated by performance advantage Fortran had over C: Fortran language standard forbids passing overlapping arrays to functions, which gave Fortran compilers more freedom in reordering and vectorizing code that operated on multiple arrays. Later GCC (since version 4.3 released in March 2008) and Clang (3.3 - June 2013) started to insert runtime checks for array overlap that would branch into vectorized code when it was safe to and to unvectorized code otherwise.
@andijcr
Copy link

andijcr commented Aug 24, 2023

tried an experiment here https://gcc.godbolt.org/z/TfsvE7YW5
and once you take bytebuf by copy (slice in my test), the aliasing difference goes away.
the main trigger for the unrolling seems to be ptr access instead of array access

@ohunter
Copy link

ohunter commented Aug 31, 2023

Great post, but one thing to note is that C++20's std::span seems to fix this for all char derivatives. At least based on my testing here: https://godbolt.org/z/ddaW1cs5q

@calc84maniac
Copy link

Great post, but one thing to note is that C++20's std::span seems to fix this for all char derivatives. At least based on my testing here: https://godbolt.org/z/ddaW1cs5q

That actually has nothing to do with std::span, but how the loop is implemented.

for (; i < b.bytes + b.len; i++) must assume that b.bytes or b.len may be aliased by the byte pointer, and thus cannot vectorize the loop.

Ranged-for loops evaluate begin() and end() once before the loop begins, creating local variables which cannot be aliased by the input span.

Modifying the bytebuf loop to evaluate b.bytes + b.len into a local before the loop, or passing the bytebuf by value instead of reference (which also guarantees it can't be aliased), generates code practically identical to the span functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment