daita@system:~$ cat ./inference_hardware_built_for_the_wrong_workload.md

Inference Hardware Is Built for the Wrong Workload

Loodud: 2026-04-26 | Suurus: 10893 baiti

TL;DR

In a January 2026 paper, Xiaoyu Ma and David Patterson (Google DeepMind) argue that today's accelerators, full-reticle dies stuffed with FLOPS, ringed by HBM, and wired with bandwidth-optimized interconnects, are a mismatch for LLM inference. Training is compute-bound and tolerates latency. Decode is memory-bound, latency-critical, and dominated by tiny messages. The authors lay out four research directions to fix it: High Bandwidth Flash (HBF), Processing-Near-Memory (PNM), 3D memory-logic stacking, and low-latency interconnects, all measured against TCO, power, and CO₂e rather than peak FLOPS. Inference chip sales are projected to grow 4X-6X annually for the next 5-8 years, so the bill for getting this wrong is enormous.

What this is not

This is not a generic "AI needs more compute" piece. The paper's argument is the opposite: LLM inference is starving for memory bandwidth and capacity, not FLOPS. It is also not about training. Training and inference share silicon today, but they are different workloads with different bottlenecks, and treating them as the same problem is exactly what got us here.

Decode and prefill are different machines

LLM inference has two phases that share a chip but stress it differently.

PhasePatternBottleneckLooks like
PrefillParallel over the promptCompute (FLOPS)Training
DecodeSequential, autoregressiveMemory bandwidthStreaming reads

The KV Cache connects them and grows with input plus output length. Reasoning models with long thought-token sequences, RAG, multimodal inputs, and Mixture of Experts (MoE) variants with up to 256 experts (DeepSeek-v3) all push the KV cache and weight memory harder. Of the six trends the paper highlights (MoE, reasoning, multimodal generation, long context, RAG, diffusion), only diffusion increases compute without increasing memory or bandwidth pressure. The other five all bend in the same direction.

The memory wall has a price tag

Patterson's group quantifies the imbalance. From 2012 to 2022, NVIDIA GPU 64-bit FLOPS grew 80X. Memory bandwidth grew 17X. The gap between what the chip can compute and what it can feed itself is now structural, not a transient.

It is also getting more expensive on the side that matters. From 2023 to 2025, normalized HBM cost per GB and per GBps both rose to about 1.35X. Over the same period, DDR4 cost per GB fell to 0.54X and per GBps to 0.45X. HBM is the part of the budget you cannot cut, and it is the part that is climbing.

DRAM density scaling tells the same story. Fourfold capacity gains used to take 3-6 years. They now take more than a decade. Memory is decelerating exactly when the workload that needs memory is accelerating.

Four directions worth funding

The paper proposes four research bets. None of them are "more FLOPS." All of them target memory or interconnect.

1. High Bandwidth Flash for 10X capacity

HBF stacks flash dies in HBM-style packages to combine HBM-class bandwidth with flash-class capacity. One stack delivers 512 GB at 1638 GB/s read bandwidth, under 80W, above 20.5 GBps/Watt, versus 48 GB per HBM4 stack. Flash density doubles every three years; DRAM does not.

The catch is write endurance and microsecond-level page-read latency, so HBF is a fit for frozen weights and slow-changing context, not the active KV cache. That is exactly the right shape for giant MoE weight tables and for 10X larger context corpora (think web indexes, code databases, RAG stores).

2. Processing-Near-Memory beats Processing-In-Memory

The paper carefully separates PIM (logic on the memory die) from PNM (logic on a separate die packaged near memory). For datacenter LLMs, PNM wins on six of eight criteria the authors compare: better logic performance per watt and per area, no impact on memory density, commodity memory pricing, looser thermal constraints, and easier software sharding (16-32 GB shards versus PIM's 32-64 MB banks). PIM still has a niche on mobile devices where the sharding problem is small.

The point is structural: do not co-locate logic on a die that has to optimize for memory density and yield. Put logic on a die that is good at being logic, and bring it close enough to share a substrate.

3. 3D stacking, with thermal asterisks

3D memory-logic stacking uses through-silicon vias for wide, dense interfaces at low power. Two flavors: compute-on-HBM-base-die (reuses HBM designs, 2-3X lower power) and custom 3D (more bandwidth via denser interfaces). The blocker is thermal. Stacked chips have less surface area for cooling, and the industry needs a standard interface for memory-logic coupling so not every accelerator vendor has to invent their own.

4. Low-latency interconnects

Training uses big collectives over fat pipes; bandwidth dominates. Decode uses frequent small messages, so latency dominates. The paper argues for high-connectivity topologies (tree, dragonfly, high-dimensional tori), processing-in-network primitives for broadcast, all-reduce, and MoE dispatch, on-chip SRAM landing zones for arriving packets, and reliability codesigned with the interconnect (substitute fake data or prior results when straggler messages time out, rather than block the critical path).

This is the most underappreciated of the four. Interconnect latency is now an inference cost driver, and almost no one budgets for it that way.

Memory hierarchy, reborn

Stack the four directions together and a familiar shape emerges: the memory hierarchy, redrawn for tokens.

KV cache is anonymous hot working set. Weights are read-only mmap. RAG corpora are page-cached cold pages. The four research directions are not four separate problems, they are the silicon, packaging, and network needed to make each tier bandwidth-feasible at LLM scale. If you have ever written a backend that paged data between Redis, Postgres, and S3, you have already shipped this hierarchy, one substrate up. The hard parts are the thermal envelope inside a stack and the latency tax between tiers.

New metrics, not new FLOPS charts

The authors push hard on this point: TCO, power, and CO₂e per useful inference token should replace peak FLOPS as the dashboard. Inference is a unit-economics problem before it is a hero-number problem. They also call for a roofline-based inference simulator that tracks memory capacity and sharding, the kind of academic tool that lets researchers test these ideas without owning a fab. If you remember when FlashAttention-4 hit 1600 TFLOPs/s on Blackwell and the bottleneck shifted to non-matmul units, this is the same story one layer down: the dashboard everyone watches is no longer the dashboard that decides cost.

The cheapest fix may be to stop decoding

The paper buries a footnote in plain sight. Of the six trends straining inference hardware, only diffusion increases compute without increasing memory pressure. If diffusion-style or other non-autoregressive LLMs (LLaDA, Mercury, generation by iterative denoising) reach quality parity with autoregressive models, the entire decode-as-streaming-reads picture goes away. Decode would start to look like prefill again: parallel, compute-bound, well-served by training silicon.

That is the fork in the road. Path A: keep autoregressive decode and rebuild the memory hierarchy in silicon, the four directions above. Path B: bet that algorithm research kills the autoregressive bottleneck, at which point current GPUs become correct again. Path A is a 5-8 year capex pivot with concrete deliverables. Path B is a research outcome with no fixed timeline and a higher payoff if it lands. Most hyperscalers will hedge both, and that hedge itself shapes which research gets funded. The cheapest possible fix to inference hardware is not new hardware.

Buyers, not vendors, decide

Patterson is telling vendors what to build. The buyers are already deploying H100 and B200 fleets at trillion-dollar scale, on multi-year contracts and power budgets measured in nuclear plants. NVIDIA still sells out every Blackwell die it ships. Why would any vendor pivot a roadmap toward HBF, PNM, and dragonfly fabrics when training-shaped silicon clears the floor at 75% gross margin?

Three forces could move the needle. Custom silicon, Google TPU, AWS Trainium/Inferentia, Meta MTIA, was already an inflection; an inference-shaped successor from any of them legitimizes the workload split. A regulatory shock on power or water draw forces the TCO math to bite. An upstart, Tenstorrent, Groq, Cerebras, ships memory-shaped silicon at price-performance that is impossible to ignore.

Inference chip sales are projected at 4X-6X annual growth for the next 5-8 years. The operating bill compounds against whichever metric the silicon was optimized for. If hyperscalers keep buying training-shaped chips for a decode-shaped workload, the four research directions stay academic and the inference market eats its own margin. The fix is not a smarter scheduler on top of GPUs designed for training. It is silicon shaped like the workload, and the cheques to demand it.


References

  1. Challenges and Research Directions for Large Language Model Inference Hardware, Original source (Ma and Patterson, January 2026)
  2. AI Inference Chip Market Insights, Verified Market Reports, 2025, growth projection cited by the authors
  3. Managed-Retention Memory: A New Class of Memory for the AI Era, Legtchenko et al., 2025, related Microsoft proposal
  4. Advancing Performance with NVIDIA SHARP In-Network Computing, Schultz, 2024, commercial processing-in-network reference
  5. FlashAttention-4: Attention at Matmul Speed, Related post, asymmetric hardware scaling at the kernel level

daita@system:~$ _