methodology · last updated 2026-05-17

The equations behind Roofline.

A first-order analytical model of training and serving large transformers on modern accelerators. Useful as a mental model; not a substitute for measured runs. This appendix records the equations we use, the constants we assume, and the failure modes we know about.

1. The roofline model

The roofline model[1] plots achievable throughput against arithmetic intensity. It separates a workload into the two regimes that a single accelerator can be in: bound by peak compute, or bound by memory bandwidth. The crossover point between the two is a function of the hardware alone, not the workload.

Arithmetic intensity (AI) is FLOPs per byte moved between HBM and SRAM. The ridge point (RP) is the AI at which the accelerator transitions from memory-bound to compute-bound.
AI=FLOPsbytes moved
(1)
RP = peak_FLOPS / peak_HBM_bandwidth
(2)
throughput = min( peak_FLOPS, AI × peak_HBM_bandwidth ) × MFU
(3)

MFU (model FLOPs utilization) is the empirical derate that captures kernel-level inefficiency: dispatch overhead, non-tensor ops, mixed-precision conversions, partial waves. We default to 0.45 for training, 0.35 for decode, and 0.55 for prefill on H100-class hardware[2]. The slider exposes this for users matching a measured run.

Implementation: calculateRidgePoint, calculateArithmeticIntensity, achievedThroughput in lib/sim/roofline.ts.

Headline assumptions: steady-state operation, uniform memory access, and overlap between compute and communication. See the for the full list.

v0.3 · data last verified 2026-05-17 · first-order model; under-claims accuracy by design.