To understand why AI models keep getting more capable and why GPU shortages have dominated tech news for three years, you need to understand NVIDIA’s hardware roadmap. The Blackwell generation — launched in late 2024 and scaling through 2025–2026 — is the engine running the current AI wave.
đź“‹ Key Takeaways
- Blackwell delivers 20 petaflops of FP4 AI performance per chip — 2.5× more than Hopper (H100)
- The GB200 NVL72 rack system connects 72 Blackwell GPUs via NVLink — treated as a single logical unit for training
- Inference efficiency is the headline improvement: 5Ă— more tokens/second at lower cost than H100
- NVIDIA's CUDA ecosystem moat remains the primary barrier to competition from AMD, Intel, and custom chips
- At $30,000–$40,000 per B200 chip, a full GB200 NVL72 rack costs $3–4 million
What Blackwell Actually Is
Blackwell is NVIDIA’s GPU microarchitecture name — the same way Intel has generations like “Alder Lake” and “Raptor Lake.” The Blackwell family includes several products targeting different parts of the AI market:
| Product | Use Case | Key Spec |
|---|---|---|
| B200 | Large-scale AI training | 20 petaflops FP4 |
| B100 | Mid-range training/inference | 14.8 petaflops FP4 |
| GB200 NVL72 | Hyperscale training clusters | 72Ă— B200 + 36Ă— Grace CPU |
| RTX 5090 | Consumer/prosumer AI workloads | 3,352 AI TOPS |
| Jetson Thor | Edge AI devices | Automotive, robotics |
The naming convention is straightforward: B = Blackwell, GB = Grace Blackwell (pairing Blackwell GPUs with NVIDIA’s own ARM-based Grace CPUs).
The Performance Numbers That Matter
The most important improvement for the industry isn’t raw training performance — it’s inference efficiency. The AI industry has shifted from primarily training new models to primarily running (inferencing) existing models at massive scale. When 400 million people use ChatGPT weekly, the cost of each query adds up fast.
Blackwell’s 5× inference improvement means AI companies can serve 5× more users from the same hardware — or serve the same number of users at 80% lower cost. This is what makes the economics of AI products more viable.
The NVLink Innovation
The GB200 NVL72 system is NVIDIA’s most ambitious hardware design. 72 Blackwell GPUs are connected via NVLink at 1.8TB/s bandwidth per chip — so fast that the entire rack operates as a single unified computing unit.
Why this matters: training large AI models requires enormous amounts of inter-GPU communication. Previously, GPUs communicated over slower PCIe or InfiniBand connections — the bottleneck that limited how efficiently you could scale training. NVLink effectively eliminates this bottleneck within the rack.
For context: the bandwidth between GPUs in a GB200 NVL72 rack is roughly 40Ă— greater than what PCIe 5.0 can provide. This is what enables training GPT-4 class models in weeks rather than months.
Who’s Buying and Why
Every major AI company is buying Blackwell as fast as NVIDIA can produce it:
OpenAI and Microsoft: Microsoft has committed to massive Blackwell purchases to support ChatGPT inference at scale. The Azure cloud’s AI capabilities depend directly on NVIDIA hardware availability.
Google: Despite having their own TPU chips, Google also purchases NVIDIA GPUs. Gemini training uses both TPUs and NVIDIA hardware. Google’s TPUs are more efficient for specific workloads but lack CUDA’s ecosystem flexibility.
Meta: Running Llama model training and the AI features in its apps on massive Blackwell clusters. Meta’s open-source AI strategy (see Meta AI review) requires frontier training capability.
Chinese companies: NVIDIA faces export restrictions on its most advanced chips (H100, B200) for China. Chinese AI labs are working around this with older NVIDIA chips (A100, which was permitted before export controls) and developing domestic alternatives. See our Chinese AI Companies 2026 overview for how this plays out.
NVIDIA’s Competitive Moat: CUDA
The hardware specs matter less than most people think. NVIDIA’s real advantage is CUDA — the programming platform that all AI frameworks (PyTorch, TensorFlow, JAX) are optimized for.
Every AI researcher, every company’s ML infrastructure team, every open-source AI library has been built on CUDA for over a decade. Switching to AMD’s ROCm or Intel’s oneAPI means rewriting and reoptimizing billions of lines of code and losing performance from years of CUDA-specific tuning.
This is why AMD can produce competitive GPU hardware on paper but can’t capture meaningful market share: the software ecosystem doesn’t follow the hardware.
The Supply Chain Reality
NVIDIA’s chips are manufactured by TSMC in Taiwan. Blackwell uses TSMC’s N4P process node (4nm class). TSMC currently produces over 90% of the world’s most advanced semiconductor nodes — a geopolitical concentration that has become a major strategic concern.
The AI data center buildout we’re seeing — $500 billion in announced investment globally — is essentially a race to acquire NVIDIA hardware before competitors do. Cloud providers are paying premium prices and committing years in advance to secure allocation.
What Comes After Blackwell
NVIDIA’s roadmap is consistent: a new architecture every two years. After Blackwell:
Rubin (2026–2027): Uses TSMC’s N3 process, HBM4 memory, higher bandwidth NVLink. Expected to deliver another 2–3× improvement over Blackwell on key AI workloads.
Rubin Ultra (2027): Scaled-up version pairing Rubin GPUs with the next-generation Grace CPU.
The pattern is clear: NVIDIA’s lead in AI hardware compounds because each generation enables training more capable AI models, which creates more demand for the next generation.
Also see: AI Data Centers 2026 · AI Memory and Compute 2026 · AI Market Statistics 2026 · Chinese AI Companies 2026