Which GPUs Are Best Optimized for LLM Inference Workloads?

April 08, 2026

The H200 SXM is the best GPU for decode-bound LLM inference workloads where large contexts and high concurrency are the limiting factors. The H100 SXM wins when you're compute-bound, running prefill-heavy workloads at moderate batch sizes where model weights fit comfortably in 80 GB.

If you're choosing between them and you're not sure which constraint applies to you, this article will show you exactly how to find out. GMI Cloud runs both H100 and H200 SXM nodes on-demand, so you can benchmark your actual workload before committing to reserved capacity.

Why LLM Decode Is Memory-Bandwidth-Bound

Before you can pick the right GPU, you need to understand the actual bottleneck. LLM inference has two phases with different performance profiles, and most of the optimization decisions hinge on which one dominates your workload.

The prefill phase processes your input prompt. It's embarrassingly parallel and compute-intensive. The arithmetic intensity (FLOPS per byte of memory accessed) is high enough that modern GPUs are compute-bound during prefill.

Tensor cores stay saturated, and more TFLOPS translates directly to faster prompt processing.

The decode phase generates each output token, one at a time. Here's where the bottleneck shifts. To generate a single token, the GPU must load the entire model weight matrix from VRAM, perform a relatively small amount of arithmetic, and repeat for the next token.

The arithmetic intensity drops to a level where the GPU's compute units sit idle most of the time, waiting for data to arrive from memory.

This is the roofline model in practice: when arithmetic intensity falls below the ridge point (the point where compute throughput equals memory bandwidth divided by peak FLOPS), you're memory-bandwidth-bound. For LLM decode with typical model sizes and batch sizes, you're well below that ridge point.

What Memory Bandwidth Actually Means for Tokens per Second

If decode throughput is dominated by memory bandwidth, then the number of tokens per second you can generate scales nearly linearly with memory bandwidth (until batch size grows large enough to shift the bottleneck).

Here's the direct comparison:

GPU	Memory BW	Relative Decode Speed (approx.)
H200 SXM	4.8 TB/s	~1.4x H100 baseline
H100 SXM	3.35 TB/s	Baseline (1.0x)
A100 80GB	2.0 TB/s	~0.60x H100
L4	300 GB/s	~0.09x H100

Sources: NVIDIA H100 Tensor Core GPU Datasheet (2023); NVIDIA H200 Tensor Core GPU Product Brief (2024); NVIDIA A100 Tensor Core GPU Datasheet; NVIDIA L4 Tensor Core GPU Datasheet. Relative decode speed is an approximation for single-GPU, memory-bound decode phase.

NVIDIA's published benchmark confirms the direction: the H200 delivers up to 1.9x inference speedup over H100 on Llama 2 70B (NVIDIA H200 Tensor Core GPU Product Brief, 2024, TensorRT-LLM, FP8, batch 64, 128/2048 tokens).

That speedup comes from bandwidth, not compute, since both GPUs share identical FP8 TFLOPS (1,979) and FP16 TFLOPS (989).

KV-Cache Sizing: The Math Behind the Memory Constraint

Beyond model weights, KV-cache is the other major VRAM consumer. Understanding how KV-cache scales tells you whether your workload fits in 80 GB or requires the H200's 141 GB.

The formula is:

KV cache per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

Let's work through a concrete example with Llama 2 70B: 80 layers, 8 KV heads, 128 head dimension, FP16 (2 bytes per element).

At 4K context (seq_len = 4,096):

2 × 80 × 8 × 128 × 4,096 × 2 = ~0.4 GB per active request

At 32K context (seq_len = 32,768):

2 × 80 × 8 × 128 × 32,768 × 2 = ~3.2 GB per active request

Now add the model weights. Llama 2 70B in FP16 takes approximately 140 GB. Add batch 32 at 4K context: that's 12.8 GB for KV-cache plus 140 GB for weights, totaling ~153 GB.

That blows past the H100's 80 GB and even strains the H200's 141 GB.

You'll notice the numbers shift dramatically when you increase context length. At 32K context and batch 16, KV-cache alone is ~51 GB. On a single H100 (80 GB), that leaves only ~27 GB after model weights, which doesn't fit.

The H200 has roughly 1 GB of headroom in that scenario, and you'd typically need KV-cache compression or a quantized model to be safe.

How Batch Size Shifts the Bottleneck

Here's the thing that most articles miss: batch size changes which bottleneck applies. At small batch sizes (batch 1-8), each decode step loads the full model for very little arithmetic work, and the GPU is deeply memory-bandwidth-bound.

As batch size grows, you're doing more arithmetic per weight load. Tokens from all requests in the batch are processed simultaneously on the same weight read. The effective arithmetic intensity rises.

At large enough batch sizes (typically 128+ for 70B models), the workload shifts toward compute-bound territory. At that point, FP8 TFLOPS starts to matter more, and the distinction between H100 and H200 narrows because they share the same compute specs.

The practical implication: if you're running real-time inference with low concurrency (batch 1-16), memory bandwidth is your primary lever. If you're running offline batch jobs with large batch sizes (batch 64+), you'll benefit from both bandwidth and compute, and the H100 becomes more competitive on a $/token basis.

Full GPU Spec Comparison: Bandwidth and Throughput

GPU	VRAM	Memory BW	FP8 TFLOPS	FP16 TFLOPS	MIG	NVLink
H200 SXM	141 GB HBM3e	4.8 TB/s	1,979	989	Up to 7	900 GB/s bidir. agg./GPU (HGX/DGX)
H100 SXM	80 GB HBM3	3.35 TB/s	1,979	989	Up to 7	900 GB/s bidir. agg./GPU (HGX/DGX)
A100 80GB	80 GB HBM2e	2.0 TB/s	N/A	312	Up to 7	600 GB/s
L4	24 GB GDDR6	300 GB/s	242	121	No	None (PCIe)
B200 (est.)	192 GB HBM3e (est.)	8.0 TB/s (est.)	~4,500 (est.)	N/A	TBD	1,800 GB/s (est.)

Sources: NVIDIA H100 Tensor Core GPU Datasheet (2023); NVIDIA H200 Tensor Core GPU Product Brief (2024); NVIDIA A100 Tensor Core GPU Datasheet; NVIDIA L4 Tensor Core GPU Datasheet. B200 specs are estimates based on GTC 2024 disclosures.

Decision Tree: Decode-Bound vs. Compute-Bound

Use this to figure out which constraint you're actually facing, and which GPU resolves it.

Step 1: What's your primary batch size in production? - Batch 1-16 (real-time, low concurrency): You're decode-bound. Memory bandwidth drives your $/token. - Batch 64+: You're approaching compute-bound territory. Both metrics matter.

Step 2: Does your model fit in 80 GB at your target context length? - Use the KV-cache formula above. Add model weights + peak KV-cache at max batch size. - Fits in 80 GB: H100 SXM is your starting point. - Exceeds 80 GB: H200 SXM or multi-GPU H100 with tensor parallelism.

Step 3: Is your workload context-length sensitive? - Short context (under 4K), small-medium models: H100 SXM is efficient and cost-effective. - Long context (16K+) or very high concurrency: H200 SXM's bandwidth and VRAM headroom justify the premium.

Step 4: Are you prefill-heavy (document processing, batch summarization)? - Prefill is compute-bound. H100 and H200 share identical compute specs, so H100 SXM wins on $/TFLOPS.

Putting It Together: H100 vs. H200 at a Glance

Workload Type	Recommended GPU	Key Reason
Real-time chat, low concurrency	H200 SXM	Bandwidth drives decode speed
Long context (16K-128K)	H200 SXM	VRAM + bandwidth headroom
Batch document processing	H100 SXM	Prefill is compute-bound
High-batch offline inference	H100 SXM	Compute bottleneck, identical specs
Model exceeds 80 GB (FP16)	H200 SXM	Only option without multi-GPU split
Budget-constrained, 70B models	A100 80GB	Lower bandwidth, but fits the model

Infrastructure That Maps to This Analysis

GMI Cloud's H100 and H200 SXM GPU instances are configured in 8-GPU nodes with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU, HGX/DGX platforms) and 3.2 Tbps InfiniBand for multi-node jobs. Pre-installed software includes TensorRT-LLM, vLLM, and Triton Inference Server.

H100 SXM runs at approximately $2.00/GPU-hour and H200 SXM at approximately $2.60/GPU-hour. Check gmicloud.ai/pricing for current rates on both on-demand and reserved configurations.

Frequently Asked Questions

Q: What is arithmetic intensity, and why does it determine my bottleneck? A: Arithmetic intensity is FLOPS performed per byte read from memory. Every GPU has a ridge point: above it, you're compute-bound; below it, you're memory-bandwidth-bound.

For LLM decode at low batch sizes, arithmetic intensity falls well below the ridge point for all modern data center GPUs, so bandwidth dominates.

Q: How do I calculate whether I'm compute-bound or memory-bandwidth-bound? A: Measure your achieved FLOPS (using nvprof or Nsight) and compare to peak FLOPS and memory bandwidth. If GPU utilization is high but memory bandwidth utilization is also at or near max, you're bandwidth-bound.

If memory bandwidth is well below max but SM utilization is high, you're compute-bound.

Q: Does KV-cache quantization help with VRAM pressure? A: Yes. Quantizing KV-cache from FP16 to FP8 halves the VRAM requirement, and INT4 quantization reduces it by 4x. The tradeoff is potential accuracy degradation at very long contexts.

KV-cache quantization is a practical tool for extending H100 viability to longer context workloads.

Q: What happens to decode performance as context length grows? A: KV-cache size grows linearly with context length. At very long contexts, VRAM fills up faster, you can support fewer concurrent requests, and effective throughput drops.

Memory bandwidth is still the primary bottleneck, but VRAM capacity becomes the secondary constraint that caps your maximum batch size.

Q: Why does the H200 outperform the H100 if they have the same TFLOPS? A: Because LLM decode is memory-bandwidth-bound at typical batch sizes. Both GPUs perform the same FLOPs per forward pass. The H200 completes each forward pass faster because it reads weights from VRAM at 4.8 TB/s versus 3.35 TB/s.

More bandwidth equals more tokens per second.

Q: Is the B200 worth waiting for? A: B200 specifications are estimates based on GTC 2024 disclosures. If confirmed, its 8.0 TB/s (est.) bandwidth would nearly double the H200's throughput for decode-bound workloads. For current production needs, H100 and H200 are the right hardware.

Add B200 to your 2026 roadmap once production availability and confirmed specs are published.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started