Solution to Erlang/Elixir String Performance: Can this be improved?
$ curl http://teralink.net/\~serpent/words.txt > words.txt
$ md5 words.txt
MD5 (words.txt) = 640ef9e082ef75ef07f0e9869e9d8ae2
$du -h words.txt
217M words.txt
$ elixir ./count.ex ./words.txt --sort 1> out-sorted.txt
Started
Using 4 Workers
Processing: 4.474982s
Load Results: 0.717677s
Order Results: 1.109028s
Print Results: 2.277602s
Total Runtime: 8.583487s
$ elixir ./count.ex ./words.txt 1> out.txt
Started
Using 4 Workers
Processing: 4.586417s
Load Results: 0.776269s
Print Results: 2.488612s
Total Runtime: 7.855561s
$ elixir ./count-stream.ex < words.txt 1> out.txt
Started
Processing: 4.148846s
Reporting: 0.749119s
Total Runtime: 4.909122s
$ elixir ./count-stream.ex --sort < words.txt 1> out-sorted.txt
Started
Processing: 4.271136s
Reporting: 1.856914s
Total Runtime: 6.137186s
This solution uses multiple workers to perform reads against the same file. It does so by first using File.stat/1
to obtain the size of the file, then generating range tuples (start position and bytes to be read) that can be farmed out to each worker. Workers are managed using Tasks.async_stream/3
with no timeout. Furthermore, the concurrency level is set at 50% of all schedulers, which on a HT-enabled machine would, by default, give one worker per real CPU core.
Instead of using a Stream, which essentially allows processing one character at a time, this solution uses raw file handlers in workers, and so the level of concurrency is greater than a solution based on Streams.
As per Erlang/OTP documentation on the raw
mode, this mode is faster as no Erlang process is needed to handle the file. However, this means that either file:read/2
or IO.binread./2
must be used to read from the file.
Reading from the file is done in chunks with the file itself having been configured with a read-ahead value. This allows the runtime system to preload bits that the code will read later, and avoids spooling too much data at once within the functional layer of the code.
The Problem at hand requires counting of words, which essentially requires the program to walk the input stream. In this solution, we use a simple assumption, which is that any character which is not a newline or a space must be part of a word, and hence encountering either would mean that we have hit word boundary. This allows speedy accumulation of in-flight words, one character at a time. Words are accumulated backwards until they need to be reported, at which point the charlist is reversed, and the counter in ETS is increased.
When reading files in multiple chunks, it is possible that words would have been split between two chunks, so there is support for capturing residual prefix and suffix from each chunk. The prefix and suffix are not immediately reported; instead the top-level function glues them together and reports them to ETS after the workers have finished their jobs.
Observations:
-
The module attributes
@bytes_read_ahead
and@bytes_read_chunk
can be tweaked. -
Changing a shared ETS table to having each Worker write to its own ETS table, and integrating them later on, makes the code more complicated, and does not improve performance.
This is a new version created to address Johanna’s comments. The fundamental differences are that:
- It uses
IO.binread/2
from STDIN and emits one chunk every@bytes_read_chunk
bytes (10 MB by default). - It uses multiple workers to concurrently split and match words.
- It reuses the prefix / suffix logic to stitch together words at boundaries and mop them up at top level.
- It uses a compiled pattern (via
:binary.compile_pattern/2
) to split binaries, which is then shared among workers. - It reports the results with a single call to
IO.puts/1
.
Additionally, instad of splitting data into a known number of chunks each with an unknown size, this version splits the data into an unknown number of chunks each with a known size. This offers several benefits,
- We could stick with a single Stream which carries on producing more chunks.
- We can leverate Task.async_stream to manage the processing of each chunk.
- Many binary splitting functions provided by the
binary
module can be used safely (some were slow if given a lot of data).
This greatly simplifies supervision logic.
Within this version there are two flags:
--unicode
which toggles betweenIO.stream/2
andIO.binstream/2
.--sort
which toggles sorting (order by count, desdending).
You are sorting it not by count