-
-
Save bruno-/660c5bdfcaa310467c5f88fc0b24f66c to your computer and use it in GitHub Desktop.
require "async" | |
CONCURRENCY = 1000 | |
ITERATIONS = 100 | |
def work | |
Async do |task| | |
CONCURRENCY.times do | |
task.async do | |
sleep 1 | |
end | |
end | |
end | |
end | |
def duration | |
start = Process.clock_gettime(Process::CLOCK_MONOTONIC) | |
work | |
Process.clock_gettime(Process::CLOCK_MONOTONIC) - start | |
end | |
def average | |
ITERATIONS.times.sum { | |
duration | |
}.fdiv(ITERATIONS) | |
end | |
puts average # => 1.01772911996115 |
CONCURRENCY = 1000 | |
ITERATIONS = 100 | |
def work | |
CONCURRENCY.times.map { | |
Thread.new do | |
sleep 1 | |
end | |
}.each(&:join) | |
end | |
def duration | |
start = Process.clock_gettime(Process::CLOCK_MONOTONIC) | |
work | |
Process.clock_gettime(Process::CLOCK_MONOTONIC) - start | |
end | |
def average | |
ITERATIONS.times.sum { | |
duration | |
}.fdiv(ITERATIONS) | |
end | |
puts average # => 1.045861059986055 |
bruno-
commented
Nov 5, 2021
- 1k threads: 45ms overhead
- 1k async tasks: 17ms overhead
Thread versus fiber startup time
One thought I had was that this is measuring time to create a fiber versus create a thread. I tried to make a thread pool and issue a "go" command, but it proved difficult. I could use a queue and block on it, but it seemed less than ideal.
Changing methods I decided to measure the time to spawn 1000 of each without the sleep to see if there as a significant difference:
## Time to make 1000 fibers
0.0062679999973624945
0.008174999966286123
0.005118000030051917
0.005705999967176467
0.005170000018551946
# => 0.006087399995885789 average
## Time to make 1000 threads
0.016810000000987202
0.008840999973472208
0.0077760000131092966
0.010698999976739287
0.007998000015504658
# => 0.01042479999596253 average
Takes about 1.7x longer to boot a thread than a fiber in this run. Normalizing the tasks to not include that startup time it would make threads a 35ms overhead and async a 11ms overhead which is the same original ballpark. So it doesn't look like boot time is the key factor.
Thread tuning
If you double the number of threads doing the job the overhead roughly doubles 1.1083407999831252
for threads and but only 1.0219341999851168
for fibers. I think what we're seeing is thread contention at play. We could likely get just as much done faster if we reduced the thread count. I'm not sure how to optimize that though. It would also require a different shaped job such as popping an item off of a queue.
When I reduce the workload to a single iteration, the thread is showing an iteration time of 1.003
with a time to create that thread at 6.319999229162932e-05
. I'm not sure what to say about this or what conclusions to draw.
How much faster?
I'm wondering what conclusions we could make about these numbers? At the end of the day, I'm curious about bottlenecks. These numbers have a large delta here, but how do they compare to real-world workloads and how do we think about it?
We could say that both workloads take 1 second so therefor a fiber is ~3x faster than a thread. However, the overall workload is not 3x faster. It's 1.45 / 1.17 # => 1.23x faster. Presumably, the delta holds steady regardless of the sleep so the percentage would change based on core work time.
If our workload was an API request that takes 0.2 seconds then it would be 0.65 / 0.27 # => 2.4x faster for a net gain of 18ms for switching to fibers which seems significant. However, it's 0.65 overhead divided across 1000 requests it's not that every request is 45ms longer right? The average delay per request is 0.045 / 1000 # => 0.000045 for threads and 0.017 / 1000 #=> 0.000017 for fibers meaning that the delta would be (0.2 + 0.000045) / (0.2 + 0.000017) # => 1.0001x faster per request which seems far less significant.
At the end of the day async is faster and the smaller your workload, the more significant the relative speedup would be. I'm not sure if that's the way to think about this comparison or not (via division etc.), but that seems kinda reasonable. I'm not sure if there's other prior art on how to think about/measure or compare overall expected impact to overall workload.
Thank you for participating in the discussion.
If you double the number of threads doing the job the overhead roughly doubles... I think what we're seeing is thread contention at play.
This makes sense.
However, the overall workload is not 3x faster. It's 1.45 / 1.17 # => 1.24x faster.
Hm, it's 45ms vs 17ms, so I think the math here should be 1.045 / 1.017 # => 1.027x or 2.7% faster.
If our workload was an API request that takes 0.2 seconds then it would be 0.65 / 0.27 # => 2.4x faster for a net gain of 18ms for switching to fibers which seems significant.
Same here, I think the math should be 0.245 / 0.217 #=> 1.129x or 12.9% faster.
The average delay per request is 0.045 / 1000 # => 0.000045 for threads and 0.017 / 1000 #=> 0.000017 for fibers meaning that the delta would be (0.2 + 0.000045) / (0.2 + 0.000017) # => 1.0001x faster per request which seems far less significant.
I'm not sure if this is the right way to look at it. I can't prove your math wrong, but the way I'm looking at it is: for 1k threads, some requests that take 0.2s will return in exactly 0.2s (thread scheduler schedules them first, no overhead), while some will return in 0.245s (thread scheduler schedules them last, 45ms overhead). My conclusion is that the average request duration for threads is 0.2225s or about 11% overhead on average.
For fibers, the average request duration would be 0.2085s or 4.2% average overhead per request (with 1000 concurrent requests).
My take is that threads have 2-3x more overhead than fibers, but it's still just 11% (threads) vs 4% (fibers) per-request average overhead. Frankly, I expected this difference to be bigger.
I'm not sure if there's other prior art on how to think about/measure or compare overall expected impact to overall workload.
Noah Gibbs did an interesting threads vs fibers comparison:
Unfortunately, links to his code examples on github return 404.
You should test real world case, like one event loop creating lots of fibers, rather than creating lots of event loops.
@bruno- thanks for catching my math mistakes 🙊 .
I'm not sure if this is the right way to look at it. I can't prove your math wrong,
In general, I'm nervous when there's a variable that I can arbitrarily change to get different (comparative) results. It makes me worried that I've gamed my own microbenchmark.
Not sure if you've seen but I've done a bunch of work in the space of trying to ensure a benchmark result are actually valid
- Take a look at comparative histograms and statistical significance checks https://github.com/zombocom/derailed_benchmarks#i-made-a-patch-to-to-rails-how-can-i-tell-if-it-made-my-rails-app-faster-and-test-for-statistical-significance
- Not specifically performance but about tuning a system that varies when number inputs are changed https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/
My conclusion is that the average request duration for threads is 0.2225s or about 11% overhead on average.
That's the average across 200 runs. Each run does 1000 operations which is what I was dividing by (if that makes sense).
I've seen the noah article ages ago, but forgotten about it (thanks for the reminder). The code is here https://github.com/noahgibbs/fiber_basic_benchmarks/tree/master/benchmarks that file got renamed it looks like.