other

H100 vs H200 vs B200 for LLM Inference: Matching GPU Choice to Model Size and Workload Type

May 28, 2026

Most GPU selection guides start with the budget and end with H100. The right starting point is the model's KV-cache footprint, because that's the number that decides whether your inference fits on one GPU or shards across four.

Teams that pick the GPU before they size the workload pay for it later, in OOM crashes during peak traffic, in batch sizes that shrink the moment context grows, in throughput that collapses from a quantization swap nobody flagged.

So this guide flips the order. You'll see how 70B dense models, MoE models, and long-context workloads each map to a different GPU tier, with the per-hour math and the engineering footnotes that matter.

The Short Answer Up Front

If you're running Llama 3 70B at FP8 with normal context lengths, H100 is the right starting point. If you're decoding longer prompts or serving a 671B MoE like DeepSeek V3, H200's 141 GB of HBM3e earns its price premium.

If you're building for the next two years of model scale, B200's 192 GB and roughly 8.0 TB/s bandwidth start to pull ahead. GB200 NVL72 is the rack-scale option for teams that already know they need it.

Workload Best Fit Why
70B dense, FP8, ≤32K context H100 SXM Fits weights + KV-cache with room to spare
70B dense, long context (32K-128K) H200 SXM 141 GB absorbs growing KV-cache
MoE (DeepSeek V3, Mixtral 8x22B) H200 SXM or B200 High capacity for active experts plus bandwidth
Long-context inference (100K+) H200 SXM or B200 KV-cache dominates VRAM; bandwidth matters
Frontier / 100B+ dense, future-proofing B200 Highest bandwidth, largest single-GPU VRAM
Hyperscale rack-scale serving GB200 NVL72 72-GPU coherent domain for huge models

How the Four GPUs Stack Up

The Hopper pair (H100 and H200) share the same compute silicon. They differ in memory. Blackwell (B200, GB200 NVL72) is a generational step in both compute and bandwidth, but pricing reflects that.

Spec H100 SXM H200 SXM B200 GB200 NVL72
VRAM 80 GB HBM3 141 GB HBM3e 192 GB HBM3e (est.) 192 GB per GPU (est.)
Memory BW 3.35 TB/s 4.8 TB/s ~8.0 TB/s (est.) ~8.0 TB/s (est.)
FP8 1,979 TFLOPS 1,979 TFLOPS ~4,500 TFLOPS (est.) ~4,500 TFLOPS (est.)
NVLink 900 GB/s (HGX, bidir per GPU) 900 GB/s (HGX, bidir per GPU) 1,800 GB/s (est.) 1,800 GB/s, rack-scale domain
GMI Cloud price $2.00/GPU-hr $2.60/GPU-hr $4.00/GPU-hr $8.00/GPU-hr effective

Sources: NVIDIA H100 Datasheet (2023), H200 Product Brief (2024), GTC 2024 keynote disclosures for B200 and GB200. B200 and GB200 numbers will firm up as MLPerf and independent benchmarks land. Check gmicloud.ai/pricing for current rates.

Workload 1: 70B Dense Models (Llama 3 70B Class)

A 70B dense model at FP8 weighs roughly 70 GB. That fits on a single H100 with about 10 GB left for KV-cache and overhead. At 4K context with FP16 KV-cache, you've got headroom for batch sizes that keep throughput honest.

Why H100 Is the Default Here

H100 hits the price-performance sweet spot for 70B FP8 serving. At $2.00/GPU-hour on GMI Cloud, you're paying for the exact compute the workload needs without renting memory you won't touch. NVIDIA's TensorRT-LLM stack is mature on Hopper, and most production inference patterns assume Hopper as the floor.

When to Reach for H200 Instead

The moment context length pushes past 16K-32K, KV-cache starts eating the VRAM that batch sizes need. H200's extra 60 GB translates to bigger batches and better decode throughput.

NVIDIA reports up to 1.9x inference speedup on Llama 2 70B vs H100, tested with TensorRT-LLM, FP8, batch 64, 128/2048 input/output tokens.

Workload 2: MoE Models (DeepSeek V3, Mixtral 8x22B)

MoE models flip the memory equation. DeepSeek V3 carries 671B total parameters but routes only about 37B active per token. The whole expert table still has to live in VRAM, so total capacity matters more than active-parameter compute.

Why H100 Falls Short for Big MoE

80 GB just doesn't fit a 671B-parameter expert table, even at FP8. You'll shard across 8-16 H100s, and inter-GPU expert routing eats bandwidth on every token. That's a working setup, but it's not the efficient one.

Where H200 and B200 Pull Ahead

H200's 141 GB lets you fit larger expert tables on fewer GPUs, cutting routing overhead. B200's roughly 8.0 TB/s bandwidth (est.) helps the all-to-all communication patterns MoE inference depends on. For Mixtral 8x22B (141B total), a pair of H200s or a single B200 keeps the expert table on-chip with bandwidth to spare.

Workload 3: Long-Context Inference (100K+ Tokens)

Long context is where KV-cache becomes the bottleneck. The formula matters here:

KV per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For Llama 2 70B (80 layers, 8 KV heads, 128 head_dim) at FP16 KV and 128K context, that's roughly 13 GB per request. Run 8 concurrent requests at that context length and you've burned 100+ GB on KV-cache alone, before model weights. H100's 80 GB can't host that workload at any meaningful batch size.

What 141 GB and 192 GB Buy You

H200 gives you the room to serve long-context inference at production batch sizes. B200's 192 GB plus the higher bandwidth means even 256K-context workloads stay viable without aggressive paging. If you're building agents that hold long tool histories, this is where the spec sheets start mattering more than the price tag.

Decision Framework

Your situation Start here
70B dense, FP8, latency-sensitive H100 SXM
70B dense, 32K-128K context H200 SXM
200B+ MoE, multi-expert routing H200 SXM or B200
100K+ context inference at batch H200 SXM or B200
Frontier models, two-year horizon B200
Hyperscale single-namespace serving GB200 NVL72

Engineering Reality: What the Spec Sheet Hides

Here's what every GPU selection guide skips. KV-cache size scales linearly with sequence length and batch size, so a workload that fits at 8K context can OOM at 32K.

When you hit the VRAM ceiling, vLLM's PagedAttention starts swapping blocks to host memory, and decode latency spikes from ~30ms to 200ms+ per token. Some stacks just OOM instead. Either way, batch sizes shrink under load, which kills tokens-per-second long before any benchmark spreadsheet warned you.

Quantization changes the math sideways. Moving from FP16 to FP8 roughly doubles your effective VRAM and throughput on Hopper, but accuracy on long-context tasks can drift by 1-3 points on standard evals.

INT8 weight-only quantization saves memory but can leave compute idle if your kernels aren't tuned for it. Test with TensorRT-LLM's FP8 path or vLLM's AWQ/GPTQ kernels before you commit.

Multi-GPU 70B+ inference depends on NVLink topology. On HGX/DGX nodes you get 900 GB/s bidirectional aggregate per GPU. Drop to PCIe interconnect and tensor-parallel latency can double. Always verify the node topology, not just the GPU SKU.

Cost-Performance Math

Workload GPU $/hr Notes
70B FP8, 8K context 1x H100 $2.00 Fits, decent batch
70B FP8, 64K context 1x H200 $2.60 Larger KV budget
Mixtral 8x22B 2x H200 or 1x B200 $5.20 vs $4.00 B200 wins on density
128K context at batch 16 1x B200 $4.00 Bandwidth + capacity
671B MoE, low latency 4-8x B200 or GB200 NVL72 $16-$32+ Rack-scale shines

GMI Cloud lists H100 at $2.00/GPU-hour, H200 at $2.60, B200 at $4.00, and GB200 NVL72 at $8.00 effective per GPU. The per-tier math stays transparent, so picking the wrong tier shows up as visible waste. See gmicloud.ai/gpu-instances for current node configurations and reserved-instance terms.

FAQ

Is H200 always 1.9x faster than H100 for 70B inference? No. The 1.9x figure is NVIDIA's official Llama 2 70B test with TensorRT-LLM, FP8, batch 64, and 128/2048 input/output tokens. Independent cloud provider tests typically show 1.4-1.6x in production-realistic conditions. Your mileage depends on context length, batch size, and quantization choices.

Can B200 replace 4x H100 for the same workload? Sometimes. On compute-bound FP8 inference, one B200's estimated 4,500 TFLOPS roughly matches 2-3 H100s, and the higher bandwidth helps memory-bound decode. But software maturity matters. TensorRT-LLM and vLLM kernels on Blackwell are still stabilizing, so benchmark before you migrate.

When does GB200 NVL72 make sense vs a B200 cluster? When your model or batch needs a coherent memory domain larger than 8 GPUs. The 72-GPU NVL72 rack acts as one giant accelerator, which matters for trillion-parameter dense or huge MoE serving. For most teams running 70B-200B class models, B200 nodes are the practical ceiling.

Should I quantize before upgrading the GPU? Often yes. Moving from FP16 to FP8 on H100 frees roughly half your VRAM and roughly doubles throughput, often more than enough to delay a tier upgrade. Run the eval suite first, because quantization can shift accuracy on long-context or reasoning-heavy workloads.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started
H100 vs H200 vs B200 for LLM Inference: Matching GPU Choice to Model Size and Workload Type | GMI Cloud