methodology · last updated 2026-05-17

The equations behind Roofline.

A first-order analytical model of training and serving large transformers on modern accelerators. Useful as a mental model; not a substitute for measured runs. This appendix records the equations we use, the constants we assume, and the failure modes we know about.

1. The roofline model

The roofline model^[1] plots achievable throughput against arithmetic intensity. It separates a workload into the two regimes that a single accelerator can be in: bound by peak compute, or bound by memory bandwidth. The crossover point between the two is a function of the hardware alone, not the workload.

Arithmetic intensity (AI) is FLOPs per byte moved between HBM and SRAM. The ridge point (RP) is the AI at which the accelerator transitions from memory-bound to compute-bound.

AI=FLOPsbytes moved

(1)

RP = peak_FLOPS / peak_HBM_bandwidth

(2)

throughput = min( peak_FLOPS, AI × peak_HBM_bandwidth ) × MFU

(3)

MFU (model FLOPs utilization) is the empirical derate that captures kernel-level inefficiency: dispatch overhead, non-tensor ops, mixed-precision conversions, partial waves. We default to 0.45 for training, 0.35 for decode, and 0.55 for prefill on H100-class hardware^[2]. The slider exposes this for users matching a measured run.

Implementation: calculateRidgePoint, calculateArithmeticIntensity, achievedThroughput in lib/sim/roofline.ts.

Headline assumptions: steady-state operation, uniform memory access, and overlap between compute and communication. See the for the full list.

Assumptions

Steady-state operation — no warmup, no restarts.
Uniform memory access — HBM at ~90–100 % when memory-bound.
Full compute–comm overlap — single overlap_fraction = 0.8.
Continuous batching efficiency = 0.85.
TP factor = 0.85 inside one NVLink domain.

Reality checks

Llama 3 70B params — dense formula within 5 %.
70B training memory — 1.26 TB vs playbook 1.40 TB.
KV cache @ b=16, s=8k — 43 GB.
GPT-3 175B time-to-train — 28 d vs paper ~34 d.
Serve $/Mtok — 3–5× above Together AI public pricing.

Known limitations

Ignores interconnect topology beyond NVLink-domain.
Ignores goodput / MTBF / hot-spare overhead.
Ignores storage I/O & checkpoint bandwidth.
Ignores speculative decoding & paged-attention reuse.
Single PUE constant — no cooling variance.

Sources

Williams et al., 2009Hoffmann et al., 2022Narayanan et al., 2021

v0.3 · data last verified 2026-05-17 · first-order model; under-claims accuracy by design.