The story of AI hardware in 2023–2024 was simple: Nvidia had a near-monopoly on AI training chips, demand was insatiable, and anyone who could get H100s was winning. By 2026, the story is more complex.
Training large models like ChatGPT and Claude remains GPU-intensive. But the dominant economic challenge has shifted: inference — running models in production for billions of users — is now the primary compute expense. And inference has a different bottleneck than training: memory.
📋 Key Takeaways
- Inference is now memory-bandwidth-bound, not compute-bound — loading model weights is the bottleneck
- SK Hynix, Samsung, and Micron manufacture virtually all HBM — a concentrated supply chain
- AMD MI300X (192GB HBM) is the main competitive challenge to Nvidia H100 (80GB) for inference
- Running a 70B parameter model requires ~140GB VRAM — beyond any single consumer GPU
- API costs for GPT-4 class models have fallen ~80% since 2023 through quantization and efficiency gains
Why AI Models Are Memory-Bound
When a large language model generates text, it performs two operations repeatedly:
- Load model weights — parameters defining the model’s behavior are moved from memory to compute
- Perform matrix multiplications — mathematical operations using those weights
For modern models, operation #1 is the bottleneck — not operation #2. GPUs are so fast at matrix multiplication that they spend significant time waiting for memory transfers. This is the “memory wall.”
A concrete example: A 70B parameter model has weights totaling ~140GB in 16-bit precision. At 3.35 TB/s bandwidth (H100 HBM3e), loading those weights for each token takes measurable time — time during which compute is idle.
This means for inference: more memory bandwidth often matters more than more raw FLOPS.
High Bandwidth Memory (HBM): The Key Technology
HBM stacks multiple memory dies vertically and connects them through silicon, placing them directly adjacent to the GPU die. This physical proximity enables bandwidth impossible with conventional memory.
HBM generation roadmap:
| Generation | Bandwidth | Used In | Year |
|---|---|---|---|
| HBM2e | 2 TB/s | A100 | 2020 |
| HBM3 | 3.35 TB/s | H100 | 2022 |
| HBM3e | 4.8 TB/s | H200 | 2024 |
| HBM4 | 6–9 TB/s | Next-gen | 2027 |
HBM manufacturers — SK Hynix (dominant), Samsung, Micron — have become strategic suppliers as important as the GPU makers themselves. SK Hynix’s dominance in HBM supply gives them significant pricing power and makes HBM availability a meaningful constraint on AI chip production. See our AI Data Centers 2026 article for how this affects infrastructure buildout.
The Inference Chip Landscape
The recognition that inference is memory-bound has opened space for alternative approaches to AI hardware:
Groq (LPU): A “Language Processing Unit” architecture that prioritizes memory bandwidth over raw compute density. Groq chips achieve very high inference throughput at low latency — generating tokens faster than GPU-based approaches for certain model sizes.
Cerebras (Wafer-Scale Engine): The entire model on a single massive chip, eliminating chip-to-chip communication overhead. CS-3 can run large models faster than GPU clusters because data never leaves the chip.
AMD MI300X: 192GB of HBM3 memory — significantly more than Nvidia H100’s 80GB — allowing larger models without splitting across multiple chips. This makes the MI300X attractive for inference workloads and is AMD’s most competitive position against Nvidia in years.
Nvidia H200/B200: Nvidia’s response — H200 increases HBM3e to 141GB and 4.8 TB/s. Blackwell B200 further improves both.
Custom silicon: Amazon (Trainium/Inferentia), Meta (MTIA), Microsoft (Maia 100) all building custom inference chips tailored to their specific model architectures.
The VRAM Problem for Developers
For developers running models locally or on cloud instances, GPU VRAM is the practical constraint. Loading a model requires VRAM sufficient to hold the weights plus working memory.
Model sizes and VRAM requirements (FP16):
| Model | Parameters | VRAM Required | Consumer GPU Options |
|---|---|---|---|
| Llama 3.2 3B | 3B | ~6GB | RTX 3060/4060 |
| Llama 3.1 8B | 8B | ~16GB | RTX 4080, 3090 |
| Llama 3.1 70B | 70B | ~140GB | 2x H100 / A100 80GB |
| Llama 3.1 405B | 405B | ~810GB | 6–8x H100 80GB |
For consumer developers, the practical frontier is around 8–13B parameters on a single high-end consumer GPU. The market for large-VRAM consumer GPUs — Nvidia RTX 4090 (24GB), RTX 6000 Ada (48GB) — has expanded significantly.
The Economic Consequences
Model size vs. deployment cost: A 700B parameter model may be more capable but costs ~10x to serve vs a 70B model. The capability difference is often less than 2x. This pushes production deployments toward smaller, more efficient models — a trend accelerated by DeepSeek’s efficiency innovations.
Quantization in production: 8-bit and 4-bit precision are now default for most production inference. The accuracy trade-off is acceptable, and memory and speed benefits are substantial.
Batch processing vs. latency: Memory bandwidth is consumed whether generating one token or many simultaneously. Production inference systems constantly balance throughput (tokens/second across all users) vs latency (time to first token per user).
What’s Next: HBM4 and Beyond
HBM4 (expected 2027) targets 6–9 TB/s — roughly double HBM3e. This will materially improve inference throughput for bandwidth-limited workloads.
Research directions to reduce memory movement itself:
Processing-in-memory (PIM): Computing within the memory chip, rather than moving data to separate compute. Still largely in research phase but could fundamentally change economics.
Sparse attention: Computing only the most relevant attention interactions, reducing memory bandwidth requirements for long-context workloads.
Structured State Space Models (SSMs): Architectures like Mamba process sequences with fixed-size state rather than growing attention matrices — dramatically reducing memory for long sequences.
Mixture-of-Experts (MoE): Only activating relevant parameters per token, reducing effective memory bandwidth per inference call. DeepSeek’s innovations demonstrated this at scale.
The memory wall won’t be solved by a single breakthrough. It will be addressed through better hardware (HBM4), better algorithms (quantization, sparse methods), and better system design (batching, caching, hardware-software co-design) — making AI capabilities cheaper and more accessible as models grow more powerful.
Also see: AI Data Centers 2026 · DeepSeek’s Efficiency Revolution · AI Market Statistics 2026