📂 Minimize allocations in Go
Working with protobuf in Go puts significant load on the memory subsystem, as protobuf-generated structures often contain a significant amount of pointers.
One way to minimize the number of allocations is to allocate all the fields at the same time, and then use internal pointers to wire them up.
Let's pretend you have (the generated Go code for) four messages A, B, C, and D (note: for clarity this is not really protobuf-generated code, but the technique genralizes trivially):
type A struct {
b *B
c *C
d *D
}
type B struct {
int
}
type C struct {
string
}
type D struct {
c *C
}
Normally, assuming you need to populate all of them, you would do something like:
a := &A{
b: &B{42},
c: &C{"hello"},
d: &D{
c: &C{"world"},
},
}
This is very idiomatic and compact, but has the downside of requiring 5 allocations just for the message structure.
Can we do better? Ideally we would like to perform a single allocation, instead of having to allocate every single field. Luckily, yes, we can.
To do this we can build an auxiliary struct
that will provide the storage for all the fields
and then wire up the pointers appropriately. To allocate an appropriate storage structure, we
can do e.g.
s := &struct {
A
_b B
_c C
_d D
_d_c C
}{}
Then, to wire up all the internal pointers (in order to maintain compatibility with callers that are unaware of the change, like the protobuf ecosystem) we can do:
// none of these allocate
a := &s.A
a.b = &s._b
a.c = &s._c
a.d = &s._d
a.d.c = &s._d_c
At this point we can initialize the values as required:
*a.b = B{42}
*a.c = C{"hello"}
*a.d.c = C{"world"}
Note that a
and *a
are, repsectively, an *A
and A
indistinguishable
from any other *A
and A
allocated in the regular way.
This playground link shows the complete example, and includes a test that shows
that a single allocations is being performed: https://play.golang.org/p/R8DBVe7jaeX.
Note that all the scaffolding for allocating the full structure has been moved to
a separate function getA
that returns a plain *A
.
There are also some possible variations on the generic approach above, that can help in specific cases, e.g.:
- only pack a subset of fields and manually allocate the others when needed
- reorder the fields for better locality
- reserve storage in each field instead of in the top level (e.g. create a second
packing structure to replace
D
, containing both*C
and the storage forC
, and include this structure in the top level storage structure).
For the record, I also suggested to implement this (as an option) in gogoproto: gogo/protobuf#555.
- Better memory locality (as all fields are contiguous to each other)
- Better cache utilization (this depends on your access patterns, it is possible that in certain cases cache utilization will be worse)
- Lower number of memory allocations, and therefore lower CPU usage
- The mechanism is generic, and can be used not just in case of protobuf messages
Are there any downsides to this approach, or it can be safely used in all cases? Glad you asked!
There are some downsides (suggestions welcome!):
- You need to decide at compile-time which fields to allocate; if then at runtime you don't need all of them, you will be wasting the memory for the fields you are not using.
- Related to the previous point: because all fields are allocated in a single object, as long as any field is alive all the other fields can not be garbage collected.
- This approach only reduces the number of allocations, not the amount of memory allocated.
None of this is a deal-breaker, but depending on your workload and use cases it may cause memory consumption to increase.
If you create closures on a hot path and the closure captures more than one variable from its parent environment, each captured variable escapes to the heap and therefore will need to be heap-allocated with a dedicated allocation. This means that your closure with N captures will normally require N+1 allocations.
func foo(a *bar, b int) func() {
// This requires 3 allocations: 1 for the closure itself, and 2 for the
// captured variables a and b.
go func() {
doSomething(a, b)
}()
}
You can cut this down to 2 allocations by grouping the captured variables in a temporary struct, and capturing the struct instead:
func foo(a *bar, b int) func() {
// This requires 2 allocations: 1 for the closure itself, and 1 for the
// captured variable x (regardless of how many fields the structure contains).
x := struct{a *bar; b int}{a, b}
go func() {
doSomething(x.a, x.b)
}()
}