1BRC C#: full-size performance work on Apple Silicon
This is the project page for my C# implementation of the One Billion Row Challenge. The long build log is published as a project blog:
The short version: I built a standalone .NET 10 solver, validated it against the official fixture suite, pushed it through many candidate optimizations, and ended up with a size-aware input strategy. On my 10-core Apple Silicon machine, bounded 100M-row runs still favor the memory-mapped parser, but the full 1B file exposed a different bottleneck. The macOS pread pipeline cut warm full-size runs from roughly 16 seconds to roughly 4 seconds by avoiding the mmap fault profile that dominated the large input.
Why this belongs in Projects
The final code is only part of the work. The reasoning trail matters too:
- start with correctness before speed
- replace line/string processing with byte parsing
- use integer tenths instead of floating point in the hot path
- keep station names as byte slices until final output
- aggregate through per-worker native-memory tables
- split ranges only on line boundaries
- reject attractive ideas when paired timings do not support them
- change the input strategy after full-size evidence showed the real bottleneck
The result is a small systems project with a research notebook attached. The blog teaches the system from the ground up; the repository keeps the implementation, validation scripts, architecture notes, and experiment history.
System shape
The production solver is organized around a few boundaries:
Program
-> OneBrcSolver
-> RuntimeOptions
-> mmap path: RangePartitioner + WorkerScheduler
-> pread path: PReadRangePartitioner + PReadWorkerScheduler + NativePRead
-> MeasurementParser
-> StationKey + StationTable + StationEntry
-> ResultFormatter
The parser reads rows as bytes:
station-name;temperature\n
Temperature values become integer tenths:
12.3 -> 123
-5.0 -> -50
Station names are not decoded during parsing. A table entry keeps a pointer, length, first word, last word, hash, and aggregate state. Only the final formatter decodes station names for sorting and output.
Main design decisions
The first working architecture used a memory-mapped file. It split the mapped bytes into line-aligned ranges, gave each worker its own table, and merged those tables after parsing. That design avoided locks in the hot loop and avoided one string allocation per row.
The table is fixed-size because the challenge caps unique stations at 10,000. The production table uses 32,768 buckets, enough room to keep probing cheap without resize logic.
The full-size pread path exists because the full input behaved differently from the 100M input. In the pread path, each worker reads its range through a reusable 16 MiB native buffer. Since those buffers are overwritten, the table copies station names only when a station is first inserted. That copy is per unique station per worker, not per row.
What lost
Several candidates passed correctness and still lost:
- SIMD temperature parsing lost because temperature bytes sit behind variable-length station names.
- 64-byte SIMD row framing lost because mask extraction and row bookkeeping cost more than the scalar delimiter scan saved.
- station front caches lost because the native table already averaged close to one probe on canonical data.
- known-station direct aggregation lost because exact matching and separate aggregate handling added more work than the normal table path.
- control-byte metadata helped too little on high-cardinality data and hurt canonical data.
- dynamic microshards reduced some CPU counters but did not robustly improve full-size wall time.
That rejection trail is part of the project. It is easy to write fast-looking code. It is harder to show that the code should survive.
Repro and porting notes
The repository includes:
README.mdfor running and validating the solverARCHITECTURE.mdfor component boundaries and porting notesRESEARCH_NOTES.mdfor experiment historyBRC_THREADS,BRC_IO=mmap, andBRC_IO=preadcontrols for local calibration
The current defaults are tuned for my Apple Silicon machine. A different CPU or OS should be treated as a fresh experiment. The right path is to run the official fixtures, compare output parity on generated data, sweep worker counts, then profile the full-size file before changing parser or table code.
The project blog walks through the build in detail: read the full write-up.