Benchmarks

The benchmark harness uses Google Benchmark, optionally augmented with hardware counters via libpfm on Linux. Each push to main runs the full matrix across several compilers; JSON results and generated SVG charts are committed to the orphan benchmark-results branch.

The latest charts (gcc-15):

Single-value latency (gcc-15) Buffer-fill throughput, scalar vs SIMD (gcc-15) Scalar vs SIMD speedup (gcc-15)

CodSpeed tracks per-benchmark CPU-cycle regressions on every PR:

CodSpeed badge

Performance

The charts above measure the two regimes that matter when picking a generator: single-value latency and bulk throughput. They tell different stories, so read them together.

Generating u64

u64 is the native output of every engine (result_type is std::uint64_t): operator() returns one value, generate(buf, n) fills a buffer. Everything else — doubles, distributions — is built on top of it, so this is the number to look at first.

Every SIMD engine generates a 256-value block at a time into an internal buffer and hands it out one value per operator() call; the block is refilled only when it drains. The benchmarks run millions of iterations, so the single-value latency you see already includes that refill, amortised across the 256 calls between refills (well under 0.5 ns/call) — it is not dominated by the fill cost.

What the per-call number is dominated by is the buffered read itself: load the cursor, load buffer[cursor], write the cursor back. That memory round-trip costs a few nanoseconds, and for an engine as cheap as Xoshiro it is more expensive than just computing the next value in registers — which is why scalar Xoshiro single-value (~1 ns) beats SIMD/Native (~3 ns). (Native shows the same ~3 ns as the dispatched SIMD path, confirming the cost is the buffered read, not runtime dispatch.) For engines whose per-value compute is expensive, the buffered SIMD block wins even one-at-a-time: ChaCha20 is ~3.5x and Philox4x32 ~1.8x faster in SIMD than scalar.

Bulk generate() is a different story. The SIMD lanes run flat out, writing straight to the caller’s buffer with no per-element cursor, so the read overhead vanishes: Xoshiro reaches ~2.3x its scalar loop (sub-0.5 ns/value). Use the bulk API whenever you need many values at once (filling a tensor, a batch of draws) — it is the fastest path by a wide margin. Reserve single-value calls for when results are consumed one at a time (rejection sampling, branchy Monte-Carlo inner loops), and prefer the scalar engine there for cheap generators.

Generating doubles: uniform() vs std::uniform_real_distribution

There are two ways to turn u64 into a uniform double in [0, 1):

  • The built-in uniform() takes the top 53 bits and scales them: (x >> 11) * 0x1.0p-53. One shift and one multiply on a value the engine already produced — essentially free on top of the u64 cost.

  • std::uniform_real_distribution<double> is the portable standard-library route. It is correct and engine-agnostic, but does considerably more work per draw (range handling, potentially multiple generator pulls).

The gap is large: scalar Xoshiro produces a double in ~1.6 ns via uniform() versus ~5.2 ns through std::uniform_real_distribution — roughly 3x. Both yield a uniform [0, 1); reach for the standard distribution only when you need its exact semantics or a non-unit range. For buffers, fill_uniform(buf, n) applies the shift-and-multiply over a bulk generate().

32-bit integer / float output paths are not currently exposed; a narrower element type would roughly double per-second element counts and halve the buffer footprint, since cost is dominated by bytes generated.

Memory requirements

Each generator instance is self-contained and heap-free: a small core state (the Xoshiro lanes, or the Philox counter+key — tens to ~100 bytes) plus the fixed 256-element output buffer the single-value path draws from (2 KiB for u64). State and buffer are over-aligned to the SIMD register width so bulk stores stay aligned. There is no shared or global state, so instances are trivially copyable and safe to keep one-per-thread.

Running benchmarks locally

cmake --preset release
cmake --build build/release --target benchmarks
./build/release/tests/benchmarks \
    --benchmark_perf_counters=CYCLES,INSTRUCTIONS,CACHE-MISSES,BRANCH-MISSES,BRANCHES \
    --benchmark_format=json --benchmark_out=bench.json

python scripts/generate_charts.py bench.json