Benchmarks
The benchmark harness uses Google Benchmark, optionally augmented with
hardware counters via libpfm on Linux. Each push to main runs the full
matrix across several compilers; JSON results and generated SVG charts are
committed to the orphan benchmark-results branch.
The latest charts (gcc-15):
CodSpeed tracks per-benchmark CPU-cycle regressions on every PR:
Performance
The charts above measure the two regimes that matter when picking a generator: single-value latency and bulk throughput. They tell different stories, so read them together.
Generating u64
u64 is the native output of every engine (result_type is
std::uint64_t): operator() returns one value, generate(buf, n) fills
a buffer. Everything else — doubles, distributions — is built on top of it, so
this is the number to look at first.
Every SIMD engine generates a 256-value block at a time into an internal
buffer and hands it out one value per operator() call; the block is refilled
only when it drains. The benchmarks run millions of iterations, so the
single-value latency you see already includes that refill, amortised across the
256 calls between refills (well under 0.5 ns/call) — it is not dominated by
the fill cost.
What the per-call number is dominated by is the buffered read itself: load the
cursor, load buffer[cursor], write the cursor back. That memory round-trip
costs a few nanoseconds, and for an engine as cheap as Xoshiro it is more
expensive than just computing the next value in registers — which is why
scalar Xoshiro single-value (~1 ns) beats SIMD/Native (~3 ns). (Native shows
the same ~3 ns as the dispatched SIMD path, confirming the cost is the buffered
read, not runtime dispatch.) For engines whose per-value compute is expensive,
the buffered SIMD block wins even one-at-a-time: ChaCha20 is ~3.5x and Philox4x32
~1.8x faster in SIMD than scalar.
Bulk generate() is a different story. The SIMD lanes run flat out, writing
straight to the caller’s buffer with no per-element cursor, so the read overhead
vanishes: Xoshiro reaches ~2.3x its scalar loop (sub-0.5 ns/value). Use the
bulk API whenever you need many values at once (filling a tensor, a batch of
draws) — it is the fastest path by a wide margin. Reserve single-value calls for
when results are consumed one at a time (rejection sampling, branchy
Monte-Carlo inner loops), and prefer the scalar engine there for cheap
generators.
Generating doubles: uniform() vs std::uniform_real_distribution
There are two ways to turn u64 into a uniform double in [0, 1):
The built-in
uniform()takes the top 53 bits and scales them:(x >> 11) * 0x1.0p-53. One shift and one multiply on a value the engine already produced — essentially free on top of theu64cost.std::uniform_real_distribution<double>is the portable standard-library route. It is correct and engine-agnostic, but does considerably more work per draw (range handling, potentially multiple generator pulls).
The gap is large: scalar Xoshiro produces a double in ~1.6 ns via uniform()
versus ~5.2 ns through std::uniform_real_distribution — roughly 3x. Both
yield a uniform [0, 1); reach for the standard distribution only when you
need its exact semantics or a non-unit range. For buffers, fill_uniform(buf,
n) applies the shift-and-multiply over a bulk generate().
32-bit integer / float output paths are not currently exposed; a narrower
element type would roughly double per-second element counts and halve the buffer
footprint, since cost is dominated by bytes generated.
Memory requirements
Each generator instance is self-contained and heap-free: a small core state (the Xoshiro lanes, or the Philox counter+key — tens to ~100 bytes) plus the fixed 256-element output buffer the single-value path draws from (2 KiB for u64). State and buffer are over-aligned to the SIMD register width so bulk stores stay aligned. There is no shared or global state, so instances are trivially copyable and safe to keep one-per-thread.
Running benchmarks locally
cmake --preset release
cmake --build build/release --target benchmarks
./build/release/tests/benchmarks \
--benchmark_perf_counters=CYCLES,INSTRUCTIONS,CACHE-MISSES,BRANCH-MISSES,BRANCHES \
--benchmark_format=json --benchmark_out=bench.json
python scripts/generate_charts.py bench.json