The Memory Wall Limits Everything

The dominant bottleneck in modern AI workloads is not computation but memory bandwidth the speed at which data can be moved to and from the processors that need it.

"A huge chunk of the time in large model training/inference is not spent computing matrix multiplies, but rather waiting for data to get to the compute resources. The obvious question is why don't architects put more memory closer to the compute. The answer is $$$." Dylan Patel

Even in 2018, purely compute-bound workloads made up 99.8% of FLOPS but only 61% of runtime. Normalization and pointwise operations achieve 250x and 700x fewer FLOPS than matrix multiplications, yet they consumed nearly 40% of the model's runtime. The reason is memory bandwidth: every operation requires reading data from DRAM, computing, and writing results back. When an operation does very little math per byte moved, you spend all your time shipping data around.

The economics of memory create a brutal hierarchy. SRAM on chip is fast but costs hundreds of dollars per gigabyte. HBM provides massive bandwidth through 3D-stacked DRAM but runs $10-20 per GB including packaging. Standard DRAM is cheap at a few dollars per GB but far too slow. From NVIDIA's P100 to the H100, compute (FP16 FLOPS) increased 46x, but memory capacity only grew 5x. This widening gap means that even with a $25,000+ GPU, you routinely achieve only 60% FLOPS utilization the rest of the time the processor sits idle, waiting for data.

The primary weapon against the memory wall is operator fusion: instead of writing intermediate results back to DRAM between each operation, you chain multiple operations together in a single pass. This is why Flash Attention, Triton kernels, and PyTorch 2.0's compiler exist they are all fundamentally about reducing memory round-trips. Understanding whether you are compute-bound, memory-bound, or overhead-bound is the single most important diagnostic in ML systems engineering.

Takeaway: Compute is cheap and getting cheaper; moving data is expensive and getting relatively more expensive so the winning architectures are the ones that minimize data movement.


See also: CUDA Is a Moat Not Just a Library | Dennard Scaling Ended and Everything Changed | Goodput Matters More Than Throughput