Quick Answer The primary bottleneck for running AI models in production is no longer GPU compute — it's memory bandwidth. Modern GPUs spend significant time waiting for model weights to be loaded from memory. High Bandwidth Memory (HBM) and new inference-specific chips (Groq, Cerebras, AMD MI300X) are the technologies reshaping AI economics.

The story of AI hardware in 2023–2024 was simple: Nvidia had a near-monopoly on AI training chips, demand was insatiable, and anyone who could get H100s was winning. By 2026, the story is more complex.

Training large models like ChatGPT and Claude remains GPU-intensive. But the dominant economic challenge has shifted: inference — running models in production for billions of users — is now the primary compute expense. And inference has a different bottleneck than training: memory.

📋 Key Takeaways

  • Inference is now memory-bandwidth-bound, not compute-bound — loading model weights is the bottleneck
  • SK Hynix, Samsung, and Micron manufacture virtually all HBM — a concentrated supply chain
  • AMD MI300X (192GB HBM) is the main competitive challenge to Nvidia H100 (80GB) for inference
  • Running a 70B parameter model requires ~140GB VRAM — beyond any single consumer GPU
  • API costs for GPT-4 class models have fallen ~80% since 2023 through quantization and efficiency gains

Why AI Models Are Memory-Bound

When a large language model generates text, it performs two operations repeatedly:

  1. Load model weights — parameters defining the model’s behavior are moved from memory to compute
  2. Perform matrix multiplications — mathematical operations using those weights

For modern models, operation #1 is the bottleneck — not operation #2. GPUs are so fast at matrix multiplication that they spend significant time waiting for memory transfers. This is the “memory wall.”

A concrete example: A 70B parameter model has weights totaling ~140GB in 16-bit precision. At 3.35 TB/s bandwidth (H100 HBM3e), loading those weights for each token takes measurable time — time during which compute is idle.

This means for inference: more memory bandwidth often matters more than more raw FLOPS.

High Bandwidth Memory (HBM): The Key Technology

HBM stacks multiple memory dies vertically and connects them through silicon, placing them directly adjacent to the GPU die. This physical proximity enables bandwidth impossible with conventional memory.

HBM generation roadmap:

GenerationBandwidthUsed InYear
HBM2e2 TB/sA1002020
HBM33.35 TB/sH1002022
HBM3e4.8 TB/sH2002024
HBM46–9 TB/sNext-gen2027

HBM manufacturers — SK Hynix (dominant), Samsung, Micron — have become strategic suppliers as important as the GPU makers themselves. SK Hynix’s dominance in HBM supply gives them significant pricing power and makes HBM availability a meaningful constraint on AI chip production. See our AI Data Centers 2026 article for how this affects infrastructure buildout.

The Inference Chip Landscape

The recognition that inference is memory-bound has opened space for alternative approaches to AI hardware:

Groq (LPU): A “Language Processing Unit” architecture that prioritizes memory bandwidth over raw compute density. Groq chips achieve very high inference throughput at low latency — generating tokens faster than GPU-based approaches for certain model sizes.

Cerebras (Wafer-Scale Engine): The entire model on a single massive chip, eliminating chip-to-chip communication overhead. CS-3 can run large models faster than GPU clusters because data never leaves the chip.

AMD MI300X: 192GB of HBM3 memory — significantly more than Nvidia H100’s 80GB — allowing larger models without splitting across multiple chips. This makes the MI300X attractive for inference workloads and is AMD’s most competitive position against Nvidia in years.

Nvidia H200/B200: Nvidia’s response — H200 increases HBM3e to 141GB and 4.8 TB/s. Blackwell B200 further improves both.

Custom silicon: Amazon (Trainium/Inferentia), Meta (MTIA), Microsoft (Maia 100) all building custom inference chips tailored to their specific model architectures.

192GBAMD MI300X HBM
141GBNvidia H200 HBM
4.8TB/s H200 bandwidth
80%API cost decline since 2023

The VRAM Problem for Developers

For developers running models locally or on cloud instances, GPU VRAM is the practical constraint. Loading a model requires VRAM sufficient to hold the weights plus working memory.

Model sizes and VRAM requirements (FP16):

ModelParametersVRAM RequiredConsumer GPU Options
Llama 3.2 3B3B~6GBRTX 3060/4060
Llama 3.1 8B8B~16GBRTX 4080, 3090
Llama 3.1 70B70B~140GB2x H100 / A100 80GB
Llama 3.1 405B405B~810GB6–8x H100 80GB

For consumer developers, the practical frontier is around 8–13B parameters on a single high-end consumer GPU. The market for large-VRAM consumer GPUs — Nvidia RTX 4090 (24GB), RTX 6000 Ada (48GB) — has expanded significantly.

💡 Quantization Tip Reducing model precision from 16-bit to 8-bit or 4-bit cuts memory requirements in half or more, with acceptable accuracy trade-offs for most tasks. In 2026, production inference of most models runs at 8-bit or 4-bit precision. For developers using local models like those compared in our ChatGPT alternatives guide, quantized versions are the practical default.

The Economic Consequences

Model size vs. deployment cost: A 700B parameter model may be more capable but costs ~10x to serve vs a 70B model. The capability difference is often less than 2x. This pushes production deployments toward smaller, more efficient models — a trend accelerated by DeepSeek’s efficiency innovations.

Quantization in production: 8-bit and 4-bit precision are now default for most production inference. The accuracy trade-off is acceptable, and memory and speed benefits are substantial.

Batch processing vs. latency: Memory bandwidth is consumed whether generating one token or many simultaneously. Production inference systems constantly balance throughput (tokens/second across all users) vs latency (time to first token per user).

What’s Next: HBM4 and Beyond

HBM4 (expected 2027) targets 6–9 TB/s — roughly double HBM3e. This will materially improve inference throughput for bandwidth-limited workloads.

Research directions to reduce memory movement itself:

Processing-in-memory (PIM): Computing within the memory chip, rather than moving data to separate compute. Still largely in research phase but could fundamentally change economics.

Sparse attention: Computing only the most relevant attention interactions, reducing memory bandwidth requirements for long-context workloads.

Structured State Space Models (SSMs): Architectures like Mamba process sequences with fixed-size state rather than growing attention matrices — dramatically reducing memory for long sequences.

Mixture-of-Experts (MoE): Only activating relevant parameters per token, reducing effective memory bandwidth per inference call. DeepSeek’s innovations demonstrated this at scale.

The memory wall won’t be solved by a single breakthrough. It will be addressed through better hardware (HBM4), better algorithms (quantization, sparse methods), and better system design (batching, caching, hardware-software co-design) — making AI capabilities cheaper and more accessible as models grow more powerful.

Also see: AI Data Centers 2026 · DeepSeek’s Efficiency Revolution · AI Market Statistics 2026