Skip to content

Instantly share code, notes, and snippets.

@ZhennanWu
Last active November 26, 2024 21:25
Show Gist options
  • Save ZhennanWu/f62f26c83a56d35164200afdaf26e3f4 to your computer and use it in GitHub Desktop.
Save ZhennanWu/f62f26c83a56d35164200afdaf26e3f4 to your computer and use it in GitHub Desktop.
Rayon vs async threadpool benchmarks

The following are the benchmark results for parallel speedup of rayon, tokio, and async-std on a 16-thread machine. (Core i7-12700K big core only, Gentoo Linux)

How to intepret the results

  1. The input variable is the "Available parallelism", e.g. how many parallel work units are spawned at the same time.
  2. The measured time is from the start of the work unit spawning to the end of the last work unit.
  3. Before and after each iteration, the threadpool is always empty.
  4. Hence the low parallelism results should be intepreted as "the runtime latency of a burst workload"

Work unit types

  1. "light": 10 random uncached memory read.
  2. "medium": 30 random uncached memory read.
  3. "heavy": 100 random uncached memory read.
  4. "xheavy": 1000 random uncached memory read. (Unless you are doing MPI-style stuff, I don't believe anyone can have workload this heavy)

If your workload is computation heavy rather than memory heavy, then a good estimate is 1 random uncached memory read ~ 100 FLOPs in modern desktop CPUs.

Test Variants

  1. spawn vs spawn_reuse: spawn_reuse will always spawn one less task, instead it will directly await the last work unit on the current task.
  2. rayon_join is a hand-rolled implementation for 2,3,4 parallel work units.
  3. std_thread_scope's performance is so abysmal for light and medium load, they are removed from the charts due to axis range problems. (Yes, they blow the scale on a logarithm chart)

Problems already taken care of in case you are suspecting

  1. No threadpool spin-up inside benchmark iteration
  2. Used criterion's async feature to avoid the constant runtime entering overhead (rayon::install, tokio::block_on, etc)
  3. No leftover hot cache from previous test, because we use different random seed for each test.
  4. How random is the psuedorandom index? Well, at least 16M unique random indices before looping back.
  5. What about random generation overhead? Two integer instructions.
@ZhennanWu
Copy link
Author

lines
lines
lines
lines

@tmillr
Copy link

tmillr commented Mar 14, 2024

Is std_thread_scope the thread scope from the Rust std lib? Or is it something from the async std crate?

@ZhennanWu
Copy link
Author

Is std_thread_scope the thread scope from the Rust std lib?

Yes, it is from the std lib.

@tmillr
Copy link

tmillr commented Mar 16, 2024

Is std_thread_scope the thread scope from the Rust std lib?

Yes, it is from the std lib.

Thanks. Hmmm I wonder why it would have worse performance, especially since thread creation is not included in the benchmark/results?

@ZhennanWu
Copy link
Author

why it would have worse performance, especially since thread creation is not included in the benchmark/results?

I only excluded constant-time costs like the threadpool creation cost. std::thread::scope is not a threadpool and its thread creation cost is not constant. Feeding new work into it always means creating new threads. It is expected that std::thread::scope will perform worse than any decent threadpools in this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment