Which GPUs Are Best Optimized for LLM Inference Workloads?
April 08, 2026
The H200 SXM is the best GPU for decode-bound LLM inference workloads where large contexts and high concurrency are the limiting factors. The H100 SXM wins when you're compute-bound, running prefill-heavy workloads at moderate batch sizes where model weights fit comfortably in 80 GB.
If you're choosing between them and you're not sure which constraint applies to you, this article will show you exactly how to find out. GMI Cloud runs both H100 and H200 SXM nodes on-demand, so you can benchmark your actual workload before committing to reserved capacity.
Why LLM Decode Is Memory-Bandwidth-Bound
Before you can pick the right GPU, you need to understand the actual bottleneck. LLM inference has two phases with different performance profiles, and most of the optimization decisions hinge on which one dominates your workload.
The prefill phase processes your input prompt. It's embarrassingly parallel and compute-intensive. The arithmetic intensity (FLOPS per byte of memory accessed) is high enough that modern GPUs are compute-bound during prefill.
Tensor cores stay saturated, and more TFLOPS translates directly to faster prompt processing.
The decode phase generates each output token, one at a time. Here's where the bottleneck shifts. To generate a single token, the GPU must load the entire model weight matrix from VRAM, perform a relatively small amount of arithmetic, and repeat for the next token.
The arithmetic intensity drops to a level where the GPU's compute units sit idle most of the time, waiting for data to arrive from memory.
This is the roofline model in practice: when arithmetic intensity falls below the ridge point (the point where compute throughput equals memory bandwidth divided by peak FLOPS), you're memory-bandwidth-bound. For LLM decode with typical model sizes and batch sizes, you're well below that ridge point.
What Memory Bandwidth Actually Means for Tokens per Second
If decode throughput is dominated by memory bandwidth, then the number of tokens per second you can generate scales nearly linearly with memory bandwidth (until batch size grows large enough to shift the bottleneck).
Here's the direct comparison:
| GPU | Memory BW | Relative Decode Speed (approx.) |
|---|---|---|
| H200 SXM | 4.8 TB/s | ~1.4x H100 baseline |
| H100 SXM | 3.35 TB/s | Baseline (1.0x) |
| A100 80GB | 2.0 TB/s | ~0.60x H100 |
| L4 | 300 GB/s | ~0.09x H100 |
Sources: NVIDIA H100 Tensor Core GPU Datasheet (2023); NVIDIA H200 Tensor Core GPU Product Brief (2024); NVIDIA A100 Tensor Core GPU Datasheet; NVIDIA L4 Tensor Core GPU Datasheet. Relative decode speed is an approximation for single-GPU, memory-bound decode phase.
NVIDIA's published benchmark confirms the direction: the H200 delivers up to 1.9x inference speedup over H100 on Llama 2 70B (NVIDIA H200 Tensor Core GPU Product Brief, 2024, TensorRT-LLM, FP8, batch 64, 128/2048 tokens).
That speedup comes from bandwidth, not compute, since both GPUs share identical FP8 TFLOPS (1,979) and FP16 TFLOPS (989).
KV-Cache Sizing: The Math Behind the Memory Constraint
Beyond model weights, KV-cache is the other major VRAM consumer. Understanding how KV-cache scales tells you whether your workload fits in 80 GB or requires the H200's 141 GB.
The formula is:
KV cache per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element
Let's work through a concrete example with Llama 2 70B: 80 layers, 8 KV heads, 128 head dimension, FP16 (2 bytes per element).
At 4K context (seq_len = 4,096):
2 × 80 × 8 × 128 × 4,096 × 2 = ~0.4 GB per active request
At 32K context (seq_len = 32,768):
2 × 80 × 8 × 128 × 32,768 × 2 = ~3.2 GB per active request
Now add the model weights. Llama 2 70B in FP16 takes approximately 140 GB. Add batch 32 at 4K context: that's 12.8 GB for KV-cache plus 140 GB for weights, totaling ~153 GB.
That blows past the H100's 80 GB and even strains the H200's 141 GB.
You'll notice the numbers shift dramatically when you increase context length. At 32K context and batch 16, KV-cache alone is ~51 GB. On a single H100 (80 GB), that leaves only ~27 GB after model weights, which doesn't fit.
The H200 has roughly 1 GB of headroom in that scenario, and you'd typically need KV-cache compression or a quantized model to be safe.
How Batch Size Shifts the Bottleneck
Here's the thing that most articles miss: batch size changes which bottleneck applies. At small batch sizes (batch 1-8), each decode step loads the full model for very little arithmetic work, and the GPU is deeply memory-bandwidth-bound.
As batch size grows, you're doing more arithmetic per weight load. Tokens from all requests in the batch are processed simultaneously on the same weight read. The effective arithmetic intensity rises.
At large enough batch sizes (typically 128+ for 70B models), the workload shifts toward compute-bound territory. At that point, FP8 TFLOPS starts to matter more, and the distinction between H100 and H200 narrows because they share the same compute specs.
The practical implication: if you're running real-time inference with low concurrency (batch 1-16), memory bandwidth is your primary lever. If you're running offline batch jobs with large batch sizes (batch 64+), you'll benefit from both bandwidth and compute, and the H100 becomes more competitive on a $/token basis.
Full GPU Spec Comparison: Bandwidth and Throughput
| GPU | VRAM | Memory BW | FP8 TFLOPS | FP16 TFLOPS | MIG | NVLink |
|---|---|---|---|---|---|---|
| H200 SXM | 141 GB HBM3e | 4.8 TB/s | 1,979 | 989 | Up to 7 | 900 GB/s bidir. agg./GPU (HGX/DGX) |
| H100 SXM | 80 GB HBM3 | 3.35 TB/s | 1,979 | 989 | Up to 7 | 900 GB/s bidir. agg./GPU (HGX/DGX) |
| A100 80GB | 80 GB HBM2e | 2.0 TB/s | N/A | 312 | Up to 7 | 600 GB/s |
| L4 | 24 GB GDDR6 | 300 GB/s | 242 | 121 | No | None (PCIe) |
| B200 (est.) | 192 GB HBM3e (est.) | 8.0 TB/s (est.) | ~4,500 (est.) | N/A | TBD | 1,800 GB/s (est.) |
Sources: NVIDIA H100 Tensor Core GPU Datasheet (2023); NVIDIA H200 Tensor Core GPU Product Brief (2024); NVIDIA A100 Tensor Core GPU Datasheet; NVIDIA L4 Tensor Core GPU Datasheet. B200 specs are estimates based on GTC 2024 disclosures.
Decision Tree: Decode-Bound vs. Compute-Bound
Use this to figure out which constraint you're actually facing, and which GPU resolves it.
Step 1: What's your primary batch size in production? - Batch 1-16 (real-time, low concurrency): You're decode-bound. Memory bandwidth drives your $/token. - Batch 64+: You're approaching compute-bound territory. Both metrics matter.
Step 2: Does your model fit in 80 GB at your target context length? - Use the KV-cache formula above. Add model weights + peak KV-cache at max batch size. - Fits in 80 GB: H100 SXM is your starting point. - Exceeds 80 GB: H200 SXM or multi-GPU H100 with tensor parallelism.
Step 3: Is your workload context-length sensitive? - Short context (under 4K), small-medium models: H100 SXM is efficient and cost-effective. - Long context (16K+) or very high concurrency: H200 SXM's bandwidth and VRAM headroom justify the premium.
Step 4: Are you prefill-heavy (document processing, batch summarization)? - Prefill is compute-bound. H100 and H200 share identical compute specs, so H100 SXM wins on $/TFLOPS.
Putting It Together: H100 vs. H200 at a Glance
| Workload Type | Recommended GPU | Key Reason |
|---|---|---|
| Real-time chat, low concurrency | H200 SXM | Bandwidth drives decode speed |
| Long context (16K-128K) | H200 SXM | VRAM + bandwidth headroom |
| Batch document processing | H100 SXM | Prefill is compute-bound |
| High-batch offline inference | H100 SXM | Compute bottleneck, identical specs |
| Model exceeds 80 GB (FP16) | H200 SXM | Only option without multi-GPU split |
| Budget-constrained, 70B models | A100 80GB | Lower bandwidth, but fits the model |
Infrastructure That Maps to This Analysis
GMI Cloud's H100 and H200 SXM GPU instances are configured in 8-GPU nodes with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU, HGX/DGX platforms) and 3.2 Tbps InfiniBand for multi-node jobs. Pre-installed software includes TensorRT-LLM, vLLM, and Triton Inference Server.
H100 SXM runs at approximately $2.00/GPU-hour and H200 SXM at approximately $2.60/GPU-hour. Check gmicloud.ai/pricing for current rates on both on-demand and reserved configurations.
Frequently Asked Questions
Q: What is arithmetic intensity, and why does it determine my bottleneck? A: Arithmetic intensity is FLOPS performed per byte read from memory. Every GPU has a ridge point: above it, you're compute-bound; below it, you're memory-bandwidth-bound.
For LLM decode at low batch sizes, arithmetic intensity falls well below the ridge point for all modern data center GPUs, so bandwidth dominates.
Q: How do I calculate whether I'm compute-bound or memory-bandwidth-bound? A: Measure your achieved FLOPS (using nvprof or Nsight) and compare to peak FLOPS and memory bandwidth. If GPU utilization is high but memory bandwidth utilization is also at or near max, you're bandwidth-bound.
If memory bandwidth is well below max but SM utilization is high, you're compute-bound.
Q: Does KV-cache quantization help with VRAM pressure? A: Yes. Quantizing KV-cache from FP16 to FP8 halves the VRAM requirement, and INT4 quantization reduces it by 4x. The tradeoff is potential accuracy degradation at very long contexts.
KV-cache quantization is a practical tool for extending H100 viability to longer context workloads.
Q: What happens to decode performance as context length grows? A: KV-cache size grows linearly with context length. At very long contexts, VRAM fills up faster, you can support fewer concurrent requests, and effective throughput drops.
Memory bandwidth is still the primary bottleneck, but VRAM capacity becomes the secondary constraint that caps your maximum batch size.
Q: Why does the H200 outperform the H100 if they have the same TFLOPS? A: Because LLM decode is memory-bandwidth-bound at typical batch sizes. Both GPUs perform the same FLOPs per forward pass. The H200 completes each forward pass faster because it reads weights from VRAM at 4.8 TB/s versus 3.35 TB/s.
More bandwidth equals more tokens per second.
Q: Is the B200 worth waiting for? A: B200 specifications are estimates based on GTC 2024 disclosures. If confirmed, its 8.0 TB/s (est.) bandwidth would nearly double the H200's throughput for decode-bound workloads. For current production needs, H100 and H200 are the right hardware.
Add B200 to your 2026 roadmap once production availability and confirmed specs are published.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
