1. The roofline model
The roofline model[1] plots achievable throughput against arithmetic intensity. It separates a workload into the two regimes that a single accelerator can be in: bound by peak compute, or bound by memory bandwidth. The crossover point between the two is a function of the hardware alone, not the workload.
MFU (model FLOPs utilization) is the empirical derate that captures kernel-level inefficiency: dispatch overhead, non-tensor ops, mixed-precision conversions, partial waves. We default to 0.45 for training, 0.35 for decode, and 0.55 for prefill on H100-class hardware[2]. The slider exposes this for users matching a measured run.
Implementation: calculateRidgePoint, calculateArithmeticIntensity, achievedThroughput in lib/sim/roofline.ts.
Headline assumptions: steady-state operation, uniform memory access, and overlap between compute and communication. See the for the full list.