1BRC in Python

This is my attempt at the 1 Billion Rows Challenge in Python.

Original Challenge
My Code

Inspiration and Attribution

Some ideas for this came from a few sources, listed below:

While I could simply follow these and have a working, fast implementation, that would not help me learn. So, while I am taking ideas from these sources, I am not following them exactly. However, I expect that my implementation will be fairly similar by the time I am done.

Measurement

I’ll use hyperfine to benchmark my script, running three tests with one warm-up run and the time is an average of 3 runs.

Data Generation

I am using a create_measurements.py file that I built myself since I do not want to install a Java environment on my system. You can find it in the repo, and I even performed a bit of speed analysis on it.

Goal / Target

I have created a baseline to work from. It is quite poor, averaging around 25 minutes on my machine. I ran Doug Mercer’s “doug booty v4” version on my machine, and it completed in around 2 minutes. My goal is to get to a similar range.

Some things to note:

  • My machine is significantly slower than others in the competition.
  • I do not have enough RAM to start a RAM disk like many other competitors.

These two factors contribute to my script running much slower than the time Doug gets in his video.

Diagnostic Tools

Some tools I used:

  • Multiple sizes of measurement files
  • hyperfine
  • time (GNU tool)
  • viztracer (python profiler)

Optimization Ideas

Some ideas for optimizing this code:

  • Parallel Processing
  • String parsing improvements
  • Using PyPy
  • Faster file reading
  • Using a flamegraph to identify bottlenecks

PyPy Performance

PyPy3 has shown to improve performance in many projects, so I’ll run most of my tests using PyPy moving forward. Below are the differences in baseline time using PyPy3 versus CPython3.

Command Mean [s] Min [s] Max [s] Relative
pypy3 ./calculate_average_baseline.py ./measurements/1B_Measurements.csv 979.168 ± 3.948 976.766 983.724 1.00
python3 ./calculate_average_baseline.py ./measurements/1B_Measurements.csv 1733.957 ± 11.034 1721.578 1742.757 1.77 ± 0.01

File Reading

I initially tested different file read modes by creating three scripts that only read the file line by line without any processing.

Command Mean [s] Min [s] Max [s] Relative
pypy3 ./ca_r.py ./measurements/100M_Measurements.csv 16.624 ± 0.565 15.996 17.091 1.88 ± 0.06
pypy3 ./ca_rb.py ./measurements/100M_Measurements.csv 8.841 ± 0.024 8.817 8.864 1.00
pypy3 ./ca_r+b.py ./measurements/100M_Measurements.csv 15.350 ± 0.422 14.982 15.810 1.74 ± 0.05

Binary mode (rb) had a significant speedup, but the extra work needed to parse binary data may negate this improvement. I tested this with full script modifications.

Command Mean [s] Min [s] Max [s] Relative
pypy3 ./calculate_average_r.py ./measurements/100M_Measurements.csv 100.081 ± 0.838 99.287 100.958 1.46 ± 0.01
pypy3 ./calculate_average_rb.py ./measurements/100M_Measurements.csv 68.539 ± 0.283 68.349 68.864 1.00

For 1 billion rows:

Command Mean [s] Min [s] Max [s] Relative
pypy3 ./calculate_average_rb.py ./measurements/1B_Measurements.csv 666.263 ± 5.518 660.911 671.934 1.00

Parallel Processing

Parallel processing is expected to be the biggest improvement. I created a dispatcher function that dynamically chooses the number of cores to use unless specified.

Command Mean [s] Min [s] Max [s] Relative
pypy3 ./calculate_average_mt.py ./measurements/1B_Measurements.csv 246.554 ± 0.181 246.352 246.700 1.00

Using Memory Mapping (mmap)

Many competitors use mmap for file reading. My concern was that I only have 16GB of RAM, while the file size is 14GB. However, my system did not crash, and performance improved.

Command Mean [s] Min [s] Max [s] Relative
pypy3 ./calculate_average_mmap.py ./measurements/1B_Measurements.csv 189.611 ± 0.293 189.359 189.932 1.00

String Parsing and Using Integers Instead of Floats

String parsing is expensive. Since our data is uniform, I replaced split() with direct indexing using find(). Additionally, since all values have one decimal place, I converted them to integers instead of floats.

The code is less readable but much faster.

Command Mean [s] Min [s] Max [s] Relative
pypy3 ./calculate_average_int.py ./measurements/1B_Measurements.csv 144.804 ± 0.524 144.327 145.365 1.00

Final Results

With all optimizations, I reduced execution time from 1733 seconds to 144 seconds. This meets my goal and represents a massive improvement! 🚀