Why do H100 and H200 have identical TFLOPS but different inference performance?

Because LLM decode is memory-bound, not compute-bound. Both GPUs have the same compute throughput (1,979 TFLOPS FP8), but H200's 4.8 TB/s bandwidth reads KV-cache and weights 43% faster than H100's 3.35 TB/s. Decode speed scales with bandwidth, not TFLOPS.

Does the bandwidth advantage matter for short responses?

Less so. For responses under 50 tokens, prefill time dominates total latency and both GPUs perform identically. The bandwidth advantage compounds with longer outputs. For a 500-token response, H200's faster decode saves significant wall-clock time.

How much VRAM does KV-cache actually use in production?

It depends on model architecture, context length, and precision. Llama 70B at FP16 KV with 4K context: ~0.4 GB per request. At 32K context: ~3.2 GB per request. With 100 concurrent requests at 32K context: 320 GB total KV-cache, requiring multiple H200s. FP8 KV halves all these numbers.

Is A100 still viable for LLM inference?

For small models (7B-34B) with short context, yes. For 70B+ models, the lack of FP8 support and lower bandwidth (2.0 TB/s vs 4.8 TB/s) make A100 significantly less cost-effective than H100 or H200 per inference completed.

Top GPUs for LLM Text Inference: Why Memory Bandwidth Decides Everything

April 27, 2026

Shopping for GPUs to serve a 70B-parameter LLM, the conversation usually starts with TFLOPS and VRAM. But the number that actually determines tokens-per-second is memory bandwidth, and it's the third line on most spec sheets. LLM text inference during the decode phase is memory-bound, not compute-bound. Every generated token requires reading the full KV-cache from VRAM, and bandwidth limits how fast that read happens. Optimizing for the right spec means faster responses for your users and lower cost per token for your business. This article covers:

The two inference phases (prefill vs decode) and why they stress GPUs in opposite ways
KV-cache math: how to calculate your memory bandwidth requirement
GPU rankings by effective decode throughput, from H200 down to L4

Two Phases, Two Different Bottlenecks

LLM inference has two distinct phases, and they stress GPUs in opposite ways. Confusing them leads to wrong GPU selection. The prefill phase is compute-heavy. The decode phase is bandwidth-heavy. Since decode runs for every output token (sometimes hundreds or thousands), it dominates total inference time for most applications.

Prefill: The Compute-Bound Phase

Prefill processes all input tokens simultaneously:

What happens: The model reads the entire input prompt and processes it through all layers in one forward pass. This involves dense matrix multiplications across every attention head and feed-forward layer.
Why it's compute-bound: All input tokens process in parallel. The GPU's TFLOPS determine how fast this completes. More TFLOPS = faster prefill.
Duration: For a 2,000-token input on a 70B model, prefill takes 50-200ms depending on GPU and optimization. It's a one-time cost per request.
H100 vs H200: Both deliver identical 1,979 TFLOPS (FP8). Prefill speed is the same on both GPUs. Bandwidth advantage doesn't help here.

Decode: The Bandwidth-Bound Phase That Decides Everything

Decode generates output tokens one at a time:

What happens: Each token requires reading the entire KV-cache from VRAM. The KV-cache stores the key and value tensors for all previous tokens across all layers. As the sequence grows, the KV-cache grows, and each token read gets larger.
KV-cache formula: KV per request = 2 x num_layers x num_kv_heads x head_dim x seq_len x bytes_per_element. For Llama 2 70B (80 layers, 8 KV heads, 128 head_dim) at FP16 with 4K context: 2 x 80 x 8 x 128 x 4096 x 2 bytes = approximately 0.4 GB per request.
Why bandwidth is the bottleneck: Each output token requires reading the model weights (~35 GB in FP8 for a 70B model) plus KV-cache from VRAM. At H100's 3.35 TB/s, reading 35 GB takes approximately 10ms. At H200's 4.8 TB/s, it takes approximately 7ms. That 3ms difference per token multiplies across hundreds of output tokens.
NVIDIA's benchmark: H200 delivers up to 1.9x inference speedup over H100 on Llama 2 70B (tested with TensorRT-LLM, FP8, batch 64, 128/2048 tokens). This speedup comes almost entirely from the bandwidth advantage during decode, not from additional compute.

Two Optimizations That Amplify Bandwidth Advantage

Hardware bandwidth sets the ceiling. Software optimizations push you closer to it:

FP8 quantization halves KV-cache memory usage. At FP8 instead of FP16, KV-cache for Llama 70B at 4K context drops from ~0.4 GB to ~0.2 GB per request. This means more concurrent requests fit in VRAM, and each token read transfers fewer bytes, directly improving tokens/sec. FP8 also halves weight reads (35 GB vs 70 GB), further increasing effective bandwidth utilization.
Speculative decoding generates multiple tokens per decode step. An 8B draft model predicts several tokens ahead, and the 70B main model verifies them in one forward pass. When predictions are correct (70-85% of the time), multiple tokens confirm per step instead of one. This effectively multiplies tokens/sec by 2-3x without additional bandwidth.
Combined effect: FP8 + speculative decoding on H200 can deliver 4-6x the tokens/sec of an unoptimized H100 deployment. The bandwidth advantage and software optimizations multiply, not just add.

GPU Rankings for LLM Text Inference

Ranked by effective decode throughput (what determines user-perceived speed):

H200 SXM (141 GB HBM3e, 4.8 TB/s, from $2.60/hr): Best for 70B+ models with long context. The 4.8 TB/s bandwidth directly translates to highest tokens/sec during decode. 141 GB VRAM accommodates FP16 models or FP8 models with massive KV-cache budgets (32K-128K context windows).
H100 SXM (80 GB HBM3, 3.35 TB/s, from $2.00/hr): Best for 70B models in FP8 with moderate context lengths (up to 8K). 80 GB VRAM fits Llama 70B FP8 weights (~35 GB) with room for KV-cache. Lower hourly cost makes it the better value when bandwidth isn't the binding constraint.
A100 80GB (80 GB HBM2e, 2.0 TB/s): Viable for 7B-34B models or batch workloads where latency isn't critical. No FP8 support means running at FP16, which doubles VRAM usage and halves effective bandwidth utilization. Legacy option for teams with existing Ampere infrastructure.
L4 (24 GB GDDR6, 300 GB/s): Development and testing only for LLM inference. 24 GB fits 7B models in INT4/INT8 quantization. 300 GB/s bandwidth limits decode speed severely. Not suitable for production LLM serving.

Bandwidth-Optimized LLM Inference Infrastructure

GMI Cloud offers H200 SXM (4.8 TB/s, 141 GB) from $2.60/GPU-hour and H100 SXM (3.35 TB/s, 80 GB) from $2.00/GPU-hour, pre-configured with TensorRT-LLM and vLLM for FP8 quantization and speculative decoding out of the box. Nodes include 8 GPUs with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms) and 3.2 Tbps InfiniBand for multi-GPU inference on models exceeding single-GPU VRAM. The unified MaaS model library offers 45+ pre-deployed LLMs for teams that prefer per-request pricing without managing bandwidth optimization. As an NVIDIA Preferred Partner built on NVIDIA Reference Platform Cloud Architecture. Check gmicloud.ai/pricing for current rates.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started