The One Billion Row Challenge (1BRC) is summarized as follows:
Write a Go program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!
Here is an example of how the text file would look like:
Hamburg;12.0 Bulawayo;8.9 Palembang;38.8 St. John's;15.2 Cracow;12.6 ...
- This comes out to be about 13 GB of data
In the article, the author describes 9 iterations that optimize the Go program from 1m45s to just 4 seconds.
Iteration 1: Simple Go program
func r1(inputPath string, output io.Writer) error { type stats struct { min, max, sum float64 count int64 } f, err := os.Open(inputPath) if err != nil { return err } defer f.Close() stationStats := make(map[string]stats) scanner := bufio.NewScanner(f) for scanner.Scan() { line := scanner.Text() station, tempStr, hasSemi := strings.Cut(line, ";") if !hasSemi { continue } temp, err := strconv.ParseFloat(tempStr, 64) if err != nil { return err } s, ok := stationStats[station] if !ok { s.min = temp s.max = temp s.sum = temp s.count = 1 } else { s.min = min(s.min, temp) s.max = max(s.max, temp) s.sum += temp s.count++ } stationStats[station] = s } stations := make([]string, 0, len(stationStats)) for station := range stationStats { stations = append(stations, station) } sort.Strings(stations) fmt.Fprint(output, "{") for i, station := range stations { if i > 0 { fmt.Fprint(output, ", ") } s := stationStats[station] mean := s.sum / float64(s.count) fmt.Fprintf(output, "%s=%.1f/%.1f/%.1f", station, s.min, mean, s.max) } fmt.Fprint(output, "}\n") return nil }
The above code ran in 1m45s
Iteration 2: Reducing number of hash operations
It turns out that in the above, we’re doing a lot more hashing than we have to.
For each line, we’re hashing the string key (station) twice: once when we try to fetch the value from the map, and once when we update the map.
To avoid that, we can use a
map[string]*stats
(pointer values) and update the pointed-to struct, instead of a map[string]stats
and updating the hash table itself.
Running a CPU profiler on the Go program
$ ./go-1brc -cpuprofile=cpu.prof -revision=1 measurements-10000000.txt >measurements-10000000.out Processed 131.6MB in 965.888929ms $ go tool pprof -http=: cpu.prof
Map operations are taking a full 30% of the time: 12.24% for assign and 17.35% for lookup.
s := stationStats[station] if s == nil { stationStats[station] = &stats{ min: temp, max: temp, sum: temp, count: 1, } } else { s.min = min(s.min, temp) s.max = max(s.max, temp) s.sum += temp s.count++ }
In the above, code, for most cases, we hit the else block and the hash operation is done only once, for accessing.
This reduced the operation time by 14 seconds to 1m31s
Iteration 3: Avoiding standard library parseFloat
The standard library function handles a ton of edge cases we don’t need to support for the simple temperatures our input has: 2 or 3 digits in the format1.2
or34.5
(and some with a minus sign in front)
Instead of copying and allocating a string when we use
Scanner.Text
, we can directly access a byte slice via Scanner.Bytes
Combined with writing custom logic for parsing temperatures, we reduce the operation time by 36 seconds, down to 55s.
negative := false index := 0 if tempBytes[index] == '-' { index++ negative = true } temp := float64(tempBytes[index] - '0') // parse first digit index++ if tempBytes[index] != '.' { temp = temp*10 + float64(tempBytes[index]-'0') // parse optional second digit index++ } index++ // skip '.' temp += float64(tempBytes[index]-'0') / 10 // parse decimal digit if negative { temp = -temp }
Iteration 4: Fixed point integers
Floating point instructions are always slower than integer ones
Since each temperature has a single decimal digit, we can use fixed point integers to represent them
For example,
34.5 => 345
type stats struct { min, max, count int32 sum int64 }
The only other change is that now you divide the fix point integer by 10 when you print:
mean := float64(s.sum) / float64(s.count) / 10 fmt.Fprintf(output, "%s=%.1f/%.1f/%.1f", station, float64(s.min)/10, mean, float64(s.max)/10)
Using integers cut the time down from 55.8 seconds to 51.0 seconds
Iteration 5: Avoid bytes.Cut
Observing the CPU Profiler again
The author looks for easiest possible optimizations:
bytes.Cut
takes 6%.Instead of using
bytes.Cut
, you can write logic to extract necessary information outend := len(line) tenths := int32(line[end-1] - '0') ones := int32(line[end-3] - '0') // line[end-2] is '.' var temp int32 var semicolon int if line[end-4] == ';' { // positive N.N temperature temp = ones*10 + tenths semicolon = end - 4 } else if line[end-4] == '-' { // negative -N.N temperature temp = -(ones*10 + tenths) semicolon = end - 5 } else { tens := int32(line[end-4] - '0') if line[end-5] == ';' { // positive NN.N temperature temp = tens*100 + ones*10 + tenths semicolon = end - 5 } else { // negative -NN.N temperature temp = -(tens*100 + ones*10 + tenths) semicolon = end - 6 } } station := line[:semicolon]
The above leads to a small improvement of 51.0 seconds to 46.0 seconds
Iteration 6: Avoid bufio.Scanner
As we see in the above CPU Profiler,
bufio.Scanner
takes up 15%.To find the end of each line, the scanner has to look through all the bytes to find the newline character. Then we process many of the bytes again to parse the temperature and find the';'
. So let’s try to combine those steps and throwbufio.Scanner
out the window.
The implementation involves:
- Allocating a 1MB buffer to read the file in chunks
- Look for the newline character in each chunk
- Processing each chunk
buf := make([]byte, 1024*1024) readStart := 0 for { n, err := f.Read(buf[readStart:]) if err != nil && err != io.EOF { return err } if readStart+n == 0 { break } chunk := buf[:readStart+n] newline := bytes.LastIndexByte(chunk, '\n') if newline < 0 { break } remaining := chunk[newline+1:] chunk = chunk[:newline+1] for { station, after, hasSemi := bytes.Cut(chunk, []byte(";"))
Removing
bufio.Scanner
and doing our own scanning cut down the time from 46.0 seconds to 41.3 seconds.Iteration 7: Use custom hash table
Next, we’re left with no other area to optimize: the map operations dominate 56% of resources.
Implementing a custom hash table instead of using Go’s
map
has two advantages:- We can hash the station name as we look for the
';'
, avoiding processing bytes twice.
- We can store each key in our hash table as a byte slice, avoiding the need to convert each key to a
string
(which will allocate and copy for every line).
The implemented hash table uses the FNV-1a hash algorithm with linear probing: if there’s a collision, use the next empty slot.
Code for this is omitted since there is not that much to learn from
The custom hash table cuts down the time from 41.3 seconds to 25.8s.
Iteration 8: Process chunks in parallel
The idea is to split the file into similar-sized chunks (one for each CPU core), fire up a thread (in Go, a goroutine) to process each chunk, and then merge the results at the end.
// Determine non-overlapping parts for file split (each part has offset and size). parts, err := splitFile(inputPath, maxGoroutines) if err != nil { return err } // Start a goroutine to process each part, returning results on a channel. resultsCh := make(chan map[string]r8Stats) for _, part := range parts { go r8ProcessPart(inputPath, part.offset, part.size, resultsCh) } // Wait for the results to come back in and aggregate them. totals := make(map[string]r8Stats) for i := 0; i < len(parts); i++ { result := <-resultsCh for station, s := range result { ts, ok := totals[station] if !ok { totals[station] = r8Stats{ min: s.min, max: s.max, sum: s.sum, count: s.count, } continue } ts.min = min(ts.min, s.min) ts.max = max(ts.max, s.max) ts.sum += s.sum ts.count += s.count totals[station] = ts } }
splitFile
looks at the size of the file, divides that by the number of parts we want, and then seeks to each part, reading 100 bytes before the end and looking for the last newline to ensure each part ends with a full line.r8ProcessPart
is shown below:func r8ProcessPart(inputPath string, fileOffset, fileSize int64, resultsCh chan map[string]r8Stats) { file, err := os.Open(inputPath) if err != nil { panic(err) } defer file.Close() _, err = file.Seek(fileOffset, io.SeekStart) if err != nil { panic(err) } f := io.LimitedReader{R: file, N: fileSize} stationStats := make(map[string]r8Stats) scanner := bufio.NewScanner(&f) for scanner.Scan() { // ... same processing as r1 ... } resultsCh <- stationStats }
Processing the input file in parallel takes the time from 1 minute 45 seconds to 24.3 seconds.
Final iteration: Combining iteration 7 and 8
This final version cut down the time from 24.3 seconds to 3.99 seconds, a huge win.
Conclusion
Thomas Wuerthinger (with credit to others) created the fastest overall solution to the original challenge in Java
On top of the above solution, other faster solutions have tried:
- memory-mapped files
- unrolled loops, non-branching parsing code, and other low-level tricks.
For those interested, there are also 1BRC challenges on the database level
1BRC in SQL with DuckDB by Robin Moffatt and 1 billion rows challenge in PostgreSQL and ClickHouse