STAGE 1 / 7

the corpus

petabyte · text · web

Every model starts with text. Roughly 15 trillion tokens scraped, deduped, filtered — about 60 TB of compressed text. A single human reading 24/7 at 200 words per minute would need ~140,000 years to finish.

raw internet HTML→deduplicated, quality-filtered text

Most of the compute cost of data prep is filtering, not scraping. Boilerplate, spam, near-duplicates, and low-quality content get stripped before a single GPU sees a token.

~15Ttokens

post-dedup

~60 TBcompressed

the readable internet

~85%english

code · web · papers

STAGE 2 / 7

into tokens

kilobyte · integers

Text is useless to a GPU. Every character sequence gets mapped to integers via byte-pair encoding — a lossless compression that learns which character clusters co-occur most often. Common words become one token; rare strings get chopped.

UTF-8 bytes→int32 array of token IDs

The full corpus, post-tokenization, is roughly 15 trillion integers — about 60 TB stored as int32. This lives on NVMe SSDs across hundreds of storage nodes.

~100Kvocab

BPE subword

12,288embed dim

GPT-class

~4 bytesper token

stored as int32

STAGE 3 / 7

stacked into batches

megabyte · tensors

The 15T token stream gets shuffled and packed into fixed-shape rectangles. Each training step feeds the cluster a single batch — typically ~16 million tokens arranged as ~2,000 sequences of 8,192 tokens each.

int32 token stream on NVMe→[2048, 8192] tensor in GPU HBM

The pipeline runs async — while GPU N is computing on batch K, the CPU is already prefetching batch K+1 from disk. If the data loader stalls, thousands of GPUs sit idle burning electricity.

B × S × Dtensor shape

batch · sequence · dim

~100 MBactivations

per layer

80–120layers

transformer stack

STAGE 4 / 7

across a cluster

exabyte · cluster · fabric

One model is too big for one GPU. Training a frontier LLM means 25,000–100,000 H100s wired together with three networks: NVLink inside a node, InfiniBand between nodes, Ethernet for control. Weights, activations, and gradients shard across the whole mesh.

1 logical model (~500 GB weights)→sharded across 25–100k GPUs

The communication layer often dominates the compute layer. After every step, gradients from every GPU have to be summed across the entire cluster — an all-reduce that moves terabytes in milliseconds.

100KGPUs

GPT-5 class fleet

~5 EFLOPSaggregate

FP8 dense peak

~100 msall-reduce

per step · the bottleneck

STAGE 5 / 7

one GPU

gigabyte · hbm · silicon

Zoom in on a single GPU. 80 GB of HBM3 on the edges, feeding the compute die at 3.35 TB/s. The die: 132 streaming multiprocessors, each with 4 tensor cores. Data flows HBM → L2 → SM registers → tensor cores → back out, millions of times per second.

weight + activation tiles in HBM→matmuls on tensor cores

Training a frontier LLM is fundamentally a bandwidth problem, not a compute problem. Tensor cores can do ~1 PFLOP/s but HBM only feeds them ~3 TB/s. Every architectural decision — flash attention, tensor parallelism, FP8 — exists to get more math done per byte moved.

VIEW IN SIMULATOR

80GB HBM

@ 3.35 TB/s

3.35TB/s

HBM bandwidth

132SMs

528 tensor cores

1,979TFLOPS FP8

peak compute

STAGE 6 / 7

down to matmul

megaflop · matmul

Almost everything the GPU does during training is matrix multiplication — ~99% of the FLOPs. Activations × weights in the forward pass, gradients × activations in the backward pass. Everything else (softmax, layer norm) is rounding error.

A [M, K] × B [K, N]→C [M, N], via M×N×K fused multiply-adds

Training is forward pass → loss → backward pass → weight update, repeated ~1 million times. The entire intelligence of an LLM reduces to a sequence of GEMM operations at astronomical scale.

VIEW IN SIMULATOR

4 × 4 × 4mma

tensor core instruction

~1 GFLOPper token

per layer

528tensor cores

per H100

STAGE 7 / 7

the wall, in silicon

picosecond · gate

At the bottom of the stack: a single fused-multiply-add circuit. Two numbers arrive as voltages on thousands of wires. Transistor gates ~4 nm wide switch from 0V to ~0.7V in tens of picoseconds, implementing multiply-then-add. 80 billion of these switches coordinate per clock tick.

3 numbers as voltage on ~24 wires→(a × b + c) as a new voltage pattern

Zoom all the way back out: 10,000 words of your question become billions of tokens become trillions of matrix elements become quadrillions of transistor switches — and the result is one more token of response.

VIEW IN SIMULATOR