Interactive zoom through the scales of LLM training, from the internet-scale data corpus down to individual transistors switching in silicon
the corpus
Every model starts with text. Roughly 15 trillion tokens scraped, deduped, filtered — about 60 TB of compressed text. A single human reading 24/7 at 200 words per minute would need ~140,000 years to finish.
Most of the compute cost of data prep is filtering, not scraping. Boilerplate, spam, near-duplicates, and low-quality content get stripped before a single GPU sees a token.
into tokens
Text is useless to a GPU. Every character sequence gets mapped to integers via byte-pair encoding — a lossless compression that learns which character clusters co-occur most often. Common words become one token; rare strings get chopped.
The full corpus, post-tokenization, is roughly 15 trillion integers — about 60 TB stored as int32. This lives on NVMe SSDs across hundreds of storage nodes.
stacked into batches
The 15T token stream gets shuffled and packed into fixed-shape rectangles. Each training step feeds the cluster a single batch — typically ~16 million tokens arranged as ~2,000 sequences of 8,192 tokens each.
The pipeline runs async — while GPU N is computing on batch K, the CPU is already prefetching batch K+1 from disk. If the data loader stalls, thousands of GPUs sit idle burning electricity.
across a cluster
One model is too big for one GPU. Training a frontier LLM means 25,000–100,000 H100s wired together with three networks: NVLink inside a node, InfiniBand between nodes, Ethernet for control. Weights, activations, and gradients shard across the whole mesh.
The communication layer often dominates the compute layer. After every step, gradients from every GPU have to be summed across the entire cluster — an all-reduce that moves terabytes in milliseconds.
one GPU
Zoom in on a single GPU. 80 GB of HBM3 on the edges, feeding the compute die at 3.35 TB/s. The die: 132 streaming multiprocessors, each with 4 tensor cores. Data flows HBM → L2 → SM registers → tensor cores → back out, millions of times per second.
Training a frontier LLM is fundamentally a bandwidth problem, not a compute problem. Tensor cores can do ~1 PFLOP/s but HBM only feeds them ~3 TB/s. Every architectural decision — flash attention, tensor parallelism, FP8 — exists to get more math done per byte moved.
VIEW IN SIMULATORdown to matmul
Almost everything the GPU does during training is matrix multiplication — ~99% of the FLOPs. Activations × weights in the forward pass, gradients × activations in the backward pass. Everything else (softmax, layer norm) is rounding error.
Training is forward pass → loss → backward pass → weight update, repeated ~1 million times. The entire intelligence of an LLM reduces to a sequence of GEMM operations at astronomical scale.
VIEW IN SIMULATORthe wall, in silicon
At the bottom of the stack: a single fused-multiply-add circuit. Two numbers arrive as voltages on thousands of wires. Transistor gates ~4 nm wide switch from 0V to ~0.7V in tens of picoseconds, implementing multiply-then-add. 80 billion of these switches coordinate per clock tick.
Zoom all the way back out: 10,000 words of your question become billions of tokens become trillions of matrix elements become quadrillions of transistor switches — and the result is one more token of response.
VIEW IN SIMULATOR3% of peak FP8
1,979 TFLOPS available · the other 97% is tensor cores waiting for memory