ZhennanWu/README.md

## README.md

      
    Raw
  

              README.md
            
          
    The following are the benchmark results for parallel speedup of rayon, tokio, and async-std on a 16-thread machine. (Core i7-12700K big core only, Gentoo Linux)
How to intepret the results


The input variable is the "Available parallelism", e.g. how many parallel work units are spawned at the same time.
The measured time is from the start of the work unit spawning to the end of the last work unit.
Before and after each iteration, the threadpool is always empty.
Hence the low parallelism results should be intepreted as "the runtime latency of a burst workload"

Work unit types


"light": 10 random uncached memory read.
"medium": 30 random uncached memory read.
"heavy": 100 random uncached memory read.
"xheavy": 1000 random uncached memory read. (Unless you are doing MPI-style stuff, I don't believe anyone can have workload this heavy)

If your workload is computation heavy rather than memory heavy, then a good estimate is 1 random uncached memory read ~ 100 FLOPs in modern desktop CPUs.
Test Variants


spawn vs spawn_reuse: spawn_reuse will always spawn one less task, instead it will directly await the last work unit on the current task.
rayon_join is a hand-rolled implementation for 2,3,4 parallel work units.
std_thread_scope's performance is so abysmal for light and medium load, they are removed from the charts due to axis range problems. (Yes, they blow the scale on a logarithm chart)

Problems already taken care of in case you are suspecting


No threadpool spin-up inside benchmark iteration
Used criterion's async feature to avoid the constant runtime entering overhead (rayon::install, tokio::block_on, etc)
No leftover hot cache from previous test, because we use different random seed for each test.
How random is the psuedorandom index? Well, at least 16M unique random indices before looping back.
What about random generation overhead? Two integer instructions.