The following are the benchmark results for parallel speedup of rayon, tokio, and async-std on a 16-thread machine. (Core i7-12700K big core only, Gentoo Linux)
- The input variable is the "Available parallelism", e.g. how many parallel work units are spawned at the same time.
- The measured time is from the start of the work unit spawning to the end of the last work unit.
- Before and after each iteration, the threadpool is always empty.
- Hence the low parallelism results should be intepreted as "the runtime latency of a burst workload"
- "light": 10 random uncached memory read.
- "medium": 30 random uncached memory read.
- "heavy": 100 random uncached memory read.
- "xheavy": 1000 random uncached memory read. (Unless you are doing MPI-style stuff, I don't believe anyone can have workload this heavy)
If your workload is computation heavy rather than memory heavy, then a good estimate is 1 random uncached memory read ~ 100 FLOPs in modern desktop CPUs.
spawn
vsspawn_reuse
:spawn_reuse
will always spawn one less task, instead it will directly await the last work unit on the current task.rayon_join
is a hand-rolled implementation for 2,3,4 parallel work units.std_thread_scope
's performance is so abysmal for light and medium load, they are removed from the charts due to axis range problems. (Yes, they blow the scale on a logarithm chart)
- No threadpool spin-up inside benchmark iteration
- Used
criterion
's async feature to avoid the constant runtime entering overhead (rayon::install
,tokio::block_on
, etc) - No leftover hot cache from previous test, because we use different random seed for each test.
- How random is the psuedorandom index? Well, at least 16M unique random indices before looping back.
- What about random generation overhead? Two integer instructions.