Picking a GPU for LLM inference isn't about who has the highest TFLOPS. It's about which card hits your SLA targets (time-to-first-token, inter-token latency), handles your concurrency, fits your context window, and keeps cost per thousand tokens within budget.
For most production teams in 2025, that choice lands on one of three NVIDIA data center GPUs: H100 for cost-efficient 7B–34B serving, H200 for 70B+ and long-context workloads, or B200 for next-generation density.
This guide covers NVIDIA SXM GPUs only. AMD MI300X, Google TPUs, and AWS Trainium follow different selection criteria and aren't covered here.
Three-Minute Workload Triage
Before you touch a spec sheet, answer three questions: how big is your model, how long is your context, and what's your latency target? The table below sorts you into the right GPU tier.
DimensionTier A: H100Tier B: H200Tier C: B200Model size7B–34B (FP8/INT8)34B–70B+ (FP8)100B–200B+Context window≤8K typical32K–128K128K+Dominant phasePrefill (batched)Decode (streaming)Both at scaleP95 TTFT target<200 ms OK<100 ms required<50 ms at 128K+ConcurrencyMedium–highHighHigh at extreme scaleBudget priorityCost per tokenThroughput per GPUFuture-proofing
Rule of thumb: if your answers span two tiers, lean toward the higher-memory option. OOM in production is far more expensive than a slightly over-provisioned GPU.
What You're Actually Buying: Memory and Bandwidth
LLM inference has two phases. Prefill processes your input prompt in parallel and is mostly compute-bound (TFLOPS matter). Decode generates tokens one at a time and is mostly memory-bandwidth-bound (TB/s matters). For interactive workloads like chatbots and code assistants, decode dominates wall-clock time.
The hidden cost is KV-cache. Every concurrent request allocates a key-value cache proportional to context length × model depth. For Llama 2 70B with FP16 KV at 4K context, that's ~0.4 GB per request. At 32 concurrent users that's ~13 GB. Scale to 32K context and KV-cache alone hits ~100 GB. It's KV-cache, not model weights, that causes most production OOM events.
VRAM Budget: One-Minute Math
Here's the formula that matters:
Total VRAM \= weights + KV-cache + framework overhead + 15–25% safety margin
Weights are fixed at your chosen precision (70B model: ~140 GB at FP16, ~70 GB at FP8, ~35 GB at INT4). KV-cache scales with concurrency and context length using this formula: KV per request ≈ 2 × layers × kv_heads × head_dim × seq_len × bytes_per_element. Framework overhead (CUDA context, PagedAttention metadata, fragmentation) typically runs 2–5 GB.
Llama 2 70B FP8 weights, FP16 KVH100 (80 GB)H200 (141 GB)Weights (~70 GB) + KV-cache (32 conc., 4K ctx, ~13 GB) + overhead (~3 GB)~86 GB: OOM~86 GB: 55 GB free
Two decision rules follow from this math. First, if the total fits in one GPU with 15%+ headroom, stay single-GPU. It's the lowest latency and simplest ops.
Second, if it doesn't fit, quantize harder before adding GPUs. Tensor parallelism across two cards works, but you're paying an interconnect tax on every forward pass.
How to Choose: H100, H200, or B200
H100 SXM: The Default Production Choice
VRAM80 GB HBM3Memory BW3.35 TB/sFP81,979 TFLOPSNVLink 4.0900 GB/s bidir. aggregate per GPU (HGX/DGX)TDP700WMIGUp to 7 instances
The H100 is today's production standard for a reason. It has the most mature software ecosystem (TensorRT-LLM, vLLM, Triton, every major quantization toolkit), the widest cloud availability with competitive spot pricing, and MIG support for multi-tenant serving of smaller models.
In MLPerf Inference v3.1, H100-based systems were the most widely submitted data center platform for Llama 2 70B and GPT-J (mlcommons.org/benchmarks/inference-datacenter).
Where it fits: 7B–34B models at production scale, batched prefill and embedding workloads, and any deployment where cost per token is the primary constraint. Also works for 70B FP8 at low concurrency if you can tolerate tight VRAM margins.
Where it struggles: 70B+ with high concurrency and long context. The 80 GB ceiling means KV-cache fills fast, forcing either 2-way TP (doubling your interconnect overhead) or aggressive quantization that may hurt quality.
H200 SXM: Buying Back VRAM and Bandwidth
VRAM141 GB HBM3e (+76% vs H100)Memory BW4.8 TB/s (+43% vs H100)FP81,979 TFLOPS (same compute)NVLink 4.0900 GB/s bidir. aggregate per GPU (HGX/DGX)TDP700W (same power)SoftwareIdentical CUDA stack, drop-in replacement
The H200 isn't a compute upgrade. It's a memory and bandwidth upgrade. The 141 GB of HBM3e means you can serve 70B FP8 on a single GPU with 55+ GB left for KV-cache and concurrency headroom. The 4.8 TB/s bandwidth directly accelerates the decode phase that determines your inter-token latency.
Per NVIDIA's official H200 product brief (2024), H200 achieves up to 1.9x inference speedup on Llama 2 70B vs. H100 (TensorRT-LLM, FP8, batch 64, 128/2048 tokens). Independent cloud provider tests confirm 1.4–1.6x gains under production loads.
Where it fits: 70B+ models, 32K–128K context windows, decode-dominant workloads (chatbots, code assistants), and any deployment where P95 TTFT or inter-token latency is SLA-critical. Also valuable when it replaces 2-way TP on H100: one H200 vs. two H100s means less hardware, less interconnect overhead, and simpler ops.
Where to be cautious: higher per-unit cost (roughly ~$2.50 vs. ~$2.10/GPU-hour on GMI Cloud; check gmicloud.ai/pricing for current rates) and tighter supply in early-mid 2025. If your models already fit in 80 GB and your bottleneck is compute rather than memory, the extra spend doesn't buy you much.
B200: Next-Generation Density (Plan, Don't Rush)
VRAM192 GB HBM3eMemory BW8.0 TB/sFP8~4,500 TFLOPS (est.*)FP4~9,000 TFLOPS (est.*)NVLink 5.01,800 GB/sTDP1,000W
* Estimates based on NVIDIA GTC 2024 disclosures. Final production numbers may differ. We'll update when MLPerf or independent benchmarks land.
Blackwell's specs make single-GPU serving of 100B+ models realistic. NVLink 5.0 doubles interconnect bandwidth for multi-GPU scaling, and native FP4 support opens extreme quantization for throughput-first use cases.
But it's early: TensorRT-LLM and vLLM kernel maturity is still catching up, supply is limited, and 1,000W TDP requires upgraded power and cooling infrastructure.
Where it fits: teams with a 2025–2026 model roadmap above 100B parameters or 128K+ context, and the appetite to absorb early-adopter risk. If you're building infrastructure today for models you'll deploy in 12 months, B200 deserves evaluation.
What about A100, L4, or consumer GPUs? A100 80GB and L4 remain viable for smaller models (7B–13B) or existing fleets.
Consumer cards (RTX 4090/5090) are fine for development but carry data center compliance risks under NVIDIA's GeForce EULA (nvidia.com/en-us/drivers/geforce-license). Neither is recommended for new production inference builds.
Cost per Thousand Tokens: The Number That Matters
Here's how to calculate the metric your CFO actually cares about:
Cost / 1K tokens \= (GPU-hour price ÷ tokens_per_second ÷ 3,600) × 1,000
If an H100 at $2.10/hour generates 800 tokens/s on your model, that's $0.00073 per 1K tokens. An H200 at $2.50/hour generating 1,400 tokens/s (thanks to higher bandwidth) costs $0.00050 per 1K tokens. The more expensive GPU is actually 32% cheaper per token.
Three levers to push that number down:
• Quantize: FP16 to FP8 nearly doubles decode throughput with <1% quality loss on most tasks. INT4 (GPTQ/AWQ) pushes further for throughput-first workloads.
• Continuous batching: dynamic request grouping with PagedAttention keeps GPU utilization high and prevents KV-cache fragmentation.
• Right-size: don't put a 7B model on an H200. Match GPU tier to model size.
Deployment: Five Things to Get Right
• Measure what matters: track P95 TTFT, P95 inter-token latency, and OOM rate, not just raw tokens/s.
• Single GPU first: it's the lowest latency and simplest to operate. Only split when VRAM math forces you.
• When you split, stay on one node: NVLink TP within an 8-GPU node adds minimal latency. Cross-node TP over InfiniBand is a last resort.
• Lock your versions: driver, CUDA, and inference engine version mismatches are the top cause of "works in staging, dies in prod."
• Pre-configured beats DIY: platforms like GMI Cloud (gmicloud.ai) ship H100/H200 nodes with CUDA 12.x, TensorRT-LLM, vLLM, Triton, and tuned NCCL pre-installed. That's days of setup you skip.
Conclusion
H100: the optimal balance for most 7B–34B production inference. Mature ecosystem, widest availability, lowest cost per token when the model fits in 80 GB.
H200: the better answer when you're running 70B+ models, long context, high concurrency, or tight latency SLAs. The extra VRAM and bandwidth often pay for themselves through fewer GPUs and simpler operations.
B200: a forward-looking choice for teams planning around 100B+ models and 128K+ context. Strong specs on paper, but accept the early-adopter tradeoffs in ecosystem maturity and supply.
Explore H100 and H200 instances built for production inference at gmicloud.ai.
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

