Interactive zoom through the scales of LLM training, from the internet-scale data corpus down to individual transistors switching in silicon

STAGE 1 / 7

the corpus

petabyte · text · web

Every model starts with text. Roughly 15 trillion tokens scraped, deduped, filtered — about 60 TB of compressed text. A single human reading 24/7 at 200 words per minute would need ~140,000 years to finish.

raw internet HTMLdeduplicated, quality-filtered text

Most of the compute cost of data prep is filtering, not scraping. Boilerplate, spam, near-duplicates, and low-quality content get stripped before a single GPU sees a token.

Relative sizes of text corpora on a log scale, from a single book to the full 15 trillion token training setNested squares showing that Wikipedia fits inside a library, which is a speck inside the full corpusFull training corpus~15T tokens · ~60 TB compressedCurated web~9T tokens · 60%Code~2T tokens · 15%Books~1.5T · 10%Papers~1T · 7%Other~1.5T · 8%= all of Wikipedia (~6B tokens)
~15Ttokens
post-dedup
~60 TBcompressed
the readable internet
~85%english
code · web · papers
STAGE 2 / 7

into tokens

kilobyte · integers

Text is useless to a GPU. Every character sequence gets mapped to integers via byte-pair encoding — a lossless compression that learns which character clusters co-occur most often. Common words become one token; rare strings get chopped.

UTF-8 bytesint32 array of token IDs

The full corpus, post-tokenization, is roughly 15 trillion integers — about 60 TB stored as int32. This lives on NVMe SSDs across hundreds of storage nodes.

A sentence being transformed into tokens, then integer IDs, then binaryThree-layer diagram showing text, token splits, and integer encodings1. Raw text (UTF-8 bytes)The quick brown fox jumps over the lazy dog.2. BPE tokenization (vocab = 128k)The quick brown fox jumps over the lazy3. Integer IDs (int32 array · 4 bytes each)[791, 4062, 14198, 39935, 35308, 927, 279, 16053, 5679, 13]
~100Kvocab
BPE subword
12,288embed dim
GPT-class
~4 bytesper token
stored as int32
STAGE 3 / 7

stacked into batches

megabyte · tensors

The 15T token stream gets shuffled and packed into fixed-shape rectangles. Each training step feeds the cluster a single batch — typically ~16 million tokens arranged as ~2,000 sequences of 8,192 tokens each.

int32 token stream on NVMe[2048, 8192] tensor in GPU HBM

The pipeline runs async — while GPU N is computing on batch K, the CPU is already prefetching batch K+1 from disk. If the data loader stalls, thousands of GPUs sit idle burning electricity.

Linear token stream being chunked into a 2D tensor of batch by sequence lengthA flowing stream of tokens gets packed into rows of a rectangular grid, which is the batch tensorLinear token stream (15 trillion, shuffled)7914062141983993535308927279160535679138122031pack into [batch, seq_len]Batch tensor — one training stepbatch = 2,048 rowsseq_len = 8,192 tokens per row
B × S × Dtensor shape
batch · sequence · dim
~100 MBactivations
per layer
80–120layers
transformer stack
STAGE 4 / 7

across a cluster

exabyte · cluster · fabric

One model is too big for one GPU. Training a frontier LLM means 25,000–100,000 H100s wired together with three networks: NVLink inside a node, InfiniBand between nodes, Ethernet for control. Weights, activations, and gradients shard across the whole mesh.

1 logical model (~500 GB weights)sharded across 25–100k GPUs

The communication layer often dominates the compute layer. After every step, gradients from every GPU have to be summed across the entire cluster — an all-reduce that moves terabytes in milliseconds.

Hierarchical diagram of an LLM training cluster from cluster down to node down to single GPUThree nested levels showing pods containing nodes, nodes containing eight GPUs connected by NVLinkCluster · ~25,000 GPUsInfiniBand fabric · 400 Gb/s per linkPod · 512 GPUs64 nodes × 8 GPUszoomNode · 8 GPUsNVLink 4 · 900 GB/sGPU 0GPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7all-to-all NVLink mesh (via NVSwitch)zoom1× H10080 GB · ~1 PFLOPSH10080B transistors4nm · 814 mm²700W TDP1.83 GHz boost
100KGPUs
GPT-5 class fleet
~5 EFLOPSaggregate
FP8 dense peak
~100 msall-reduce
per step · the bottleneck
STAGE 5 / 7

one GPU

gigabyte · hbm · silicon

Zoom in on a single GPU. 80 GB of HBM3 on the edges, feeding the compute die at 3.35 TB/s. The die: 132 streaming multiprocessors, each with 4 tensor cores. Data flows HBM → L2 → SM registers → tensor cores → back out, millions of times per second.

weight + activation tiles in HBMmatmuls on tensor cores

Training a frontier LLM is fundamentally a bandwidth problem, not a compute problem. Tensor cores can do ~1 PFLOP/s but HBM only feeds them ~3 TB/s. Every architectural decision — flash attention, tensor parallelism, FP8 — exists to get more math done per byte moved.

VIEW IN SIMULATOR
Cross-section of an H100 GPU showing HBM memory stacks flanking the compute die with streaming multiprocessorsSchematic of GPU architecture with memory on the sides and a grid of SM tiles in the centerH100 SXM5 packageHBM3 stack16 GB16-high DRAMHBM3 stack16 GB16-high DRAMCompute die · 814 mm² · 80B transistorsL2 cache · 50 MB · shared across all SMs132 SMs · 4 tensor cores each = 528 totalhighlighted = one SM executing a warp of 32 threads
80GB HBM
@ 3.35 TB/s
3.35TB/s
HBM bandwidth
132SMs
528 tensor cores
1,979TFLOPS FP8
peak compute
STAGE 6 / 7

down to matmul

megaflop · matmul

Almost everything the GPU does during training is matrix multiplication — ~99% of the FLOPs. Activations × weights in the forward pass, gradients × activations in the backward pass. Everything else (softmax, layer norm) is rounding error.

A [M, K] × B [K, N]C [M, N], via M×N×K fused multiply-adds

Training is forward pass → loss → backward pass → weight update, repeated ~1 million times. The entire intelligence of an LLM reduces to a sequence of GEMM operations at astronomical scale.

VIEW IN SIMULATOR
Two matrices multiplying to produce a third, with one output element highlighted showing how it comes from a row and column dot productVisual of matrix A times matrix B equals matrix C with highlighted row, column, and output elementA (activations)[4096 × 16384]×B (weights)[16384 × 4096]== Σ(row · col)16,384 FMAsC (output)[4096 × 4096]Total FMAs per multiply: ~275 billionOn one H100 at ~1 PFLOPS: ~0.55 ms · tensor cores do 256+ FMAs/cycle/core
4 × 4 × 4mma
tensor core instruction
~1 GFLOPper token
per layer
528tensor cores
per H100
STAGE 7 / 7

the wall, in silicon

picosecond · gate

At the bottom of the stack: a single fused-multiply-add circuit. Two numbers arrive as voltages on thousands of wires. Transistor gates ~4 nm wide switch from 0V to ~0.7V in tens of picoseconds, implementing multiply-then-add. 80 billion of these switches coordinate per clock tick.

3 numbers as voltage on ~24 wires(a × b + c) as a new voltage pattern

Zoom all the way back out: 10,000 words of your question become billions of tokens become trillions of matrix elements become quadrillions of transistor switches — and the result is one more token of response.

VIEW IN SIMULATOR
Diagram of a fused multiply-add circuit showing inputs flowing through multiplier and adder gates built from transistorsThree-way FMA circuit with input voltages, multiplier block, adder block, and output, with transistor switching indicatorsA = 0.73B = 0.91C = 0.12inputs (FP8)8 bits eachmultiplierA × B = 0.664~200 transistors~20 ps latencyadder(A×B) + C = 0.784~300 transistors~15 ps latencyout = 0.784output (FP32 accum)next cycle: += to COne FMA: ~500 transistors. H100 fires ~550 trillion FMAs/sec.At any instant, billions of transistors are switching between 0 and 0.7V.
14Btransistors
on one 814 mm² die
~1.8GHz
base clock
~3%active
tensor cores @ decode · the wall
THE WALL · WHAT YOU ACTUALLY GET

3% of peak FP8

1,979 TFLOPS available · the other 97% is tensor cores waiting for memory

now what