You're deploying an LLM into production. The model works in your notebook, and now you need to figure out which GPU actually runs it at target latency, target concurrency, and a cost your CFO won't reject. That's what this guide is for.
We cover five NVIDIA data center GPUs, leading with the two that dominate production today: H100 SXM and H200 SXM. We then cover A100 80GB, L4, and B 200 as alternatives at different price points. For each card, you'll get the key specs, what workload it fits, and the tradeoffs you should know about.
We also include a cost-performance decision framework with real $/GPU-hour anchors, a VRAM budget breakdown that accounts for KV-cache (not just weights), and how GMI Cloud delivers this infrastructure on demand. (This guide focuses on NVIDIA GPUs. AMD MI300X, Google TPUs, and AWS Trainium follow different selection criteria and aren't covered here.)
Head-to-Head: Key Specs at a Glance
| Spec | H100 SXM | H200 SXM | A100 80GB | L4 | B200 |
|---|---|---|---|---|---|
| VRAM | 80 GB HBM3 | 141 GB HBM3e | 80 GB HBM2e | 24 GB GDDR6 | 192 GB HBM3e |
| BW (TB/s) | 3.35 | 4.8 | 2.0 | 0.3 | 8.0 |
| FP8 TOPS | 1,979 | 1,979 | N/A | 242 | ~4,500* |
| INT8 TOPS | 3,958 | 3,958 | 624 | 485 | ~9,000* |
| TDP | 700W | 700W | 400W | 72W | 1,000W |
| NVLink | 900 GB/s¹ | 900 GB/s¹ | 600 GB/s | No (PCIe) | 1,800 GB/s |
| MIG | Yes (7) | Yes (7) | Yes (7) | No | TBD |
* B200 figures are estimates from NVIDIA GTC 2024; final production numbers may differ. ¹ NVLink bandwidth is bidirectional aggregate per GPU on HGX/DGX platforms. Sources: NVIDIA L4 Datasheet, A100 Datasheet, H100 Datasheet, H200 Product Brief (2024).
The Five GPUs, Ranked by Use Case
1. NVIDIA H100 SXM: The Production Standard
The H100 is where production inference lives today. FP8 Transformer Engine nearly doubles throughput over FP16. MIG lets you partition one GPU into up to 7 instances for multi-tenant serving.
The software ecosystem is the most mature in the industry: full support across TensorRT-LLM, vLLM, Triton, and every major quantization toolkit (GPTQ, AWQ, SmoothQuant). In MLPerf Inference v3.1, H100-based systems were the most widely submitted data center platform for Llama 2 70B and GPT-J (see mlcommons.org/benchmarks/inference-datacenter).
80 GB HBM3 at 3.35 TB/s is sufficient for 7B–34B models with generous batch headroom. For 70B models at FP8, it's tight: you'll likely need 2-way tensor parallelism or aggressive quantization once KV-cache scales with concurrency.
Best for: 7B–70B models (with quantization), latency-sensitive online serving, multi-tenant deployments via MIG. The safest choice for most production inference today.
2. NVIDIA H200 SXM: The Memory-Bandwidth Upgrade
Same Hopper architecture as H100, but 141 GB HBM3e at 4.8 TB/s. The extra VRAM and bandwidth are specifically designed for the decode bottleneck in large-model inference. Per NVIDIA's official H200 product brief (2024), H200 achieves up to 1.9x speedup on Llama 2 70B vs. H100 (TensorRT-LLM, FP8, batch 64, 128/2048 tokens). Independent cloud provider tests confirm 1.4–1.6x gains under production loads.
Zero migration cost from H100: identical CUDA stack, same NVLink topology, drop-in replacement. What previously required 2-way TP on H100 often fits on a single H200, eliminating inter-GPU overhead.
Best for: 70B+ models, 32K–128K context windows, decode-dominant workloads (chatbots, code assistants), P95 latency-critical SLAs.
3. NVIDIA A100 80GB: The Proven Mid-Tier
The A100 is one generation behind Hopper but still runs a huge share of production inference worldwide. 80 GB HBM2e at 2.0 TB/s handles 7B–34B models comfortably. It lacks native FP8 support (you'll use INT8 or FP16 instead), and the lower memory bandwidth makes it noticeably slower than H100 on decode-heavy workloads.
That said, A100 availability is excellent, pricing is often 30–40% below H100, and the MIG support (up to 7 instances) makes it viable for multi-tenant serving of smaller models.
Best for: 7B–34B models, teams with existing A100 inventory, workloads where cost/token matters more than absolute latency.
4. NVIDIA L4: The Budget Inference Card
The L4 is a low-power PCIe card designed specifically for inference. At 72W TDP and 24 GB GDDR6, it's built for 7B models (INT8/INT4) in cost-sensitive or edge-adjacent deployments. It doesn't have NVLink, so multi-GPU scaling is limited. But if you're serving a quantized 7B model at moderate concurrency, the L4's cost per token is hard to beat.
Best for: 7B INT8/INT4 models, low-concurrency online serving, batch/offline inference, edge or power-constrained environments.
Why not RTX 4090/5090? Consumer cards offer strong raw performance at 24 GB, but lack ECC memory, NVLink, and MIG. NVIDIA's GeForce EULA also contains data center use restrictions (see nvidia.com/en-us/drivers/geforce-license). They're fine for development. Don't build production on them.
5. NVIDIA B200: The Next-Generation Bet
Blackwell architecture. 192 GB HBM3e at 8.0 TB/s with native FP4 support. NVLink 5.0 doubles interconnect bandwidth to 1,800 GB/s.
These specs make single-GPU serving of 100B+ models feasible. It's still in early availability, and TensorRT-LLM/vLLM kernel maturity is catching up. The 1,000W TDP also requires upgraded infrastructure.
* B200 TFLOPS estimates based on NVIDIA GTC 2024 keynote. We'll update when MLPerf/independent benchmarks land.
Best for: 200B+ models, maximum single-GPU capacity, teams building 2025–2026 infrastructure.
Cost-Performance Decision Framework
Picking a GPU isn't just about specs. It's about cost per token at your target latency. Here's a framework you can actually execute:
Step 1: Size your model. Determine weight footprint at your target precision (FP16, FP8, INT8, INT4). A 70B model is ~140 GB at FP16, ~70 GB at FP8, ~35 GB at INT4.
Step 2: Budget your VRAM. Add KV-cache: roughly KV per request ≈ 2 × layers × kv_heads × head_dim × seq_len × bytes_per_element. For Llama 2 70B with FP16 KV at 4K context, that's ~0.4 GB/request. At 32 concurrent users: ~13 GB. At 32K context: ~100 GB. Add 2–5 GB for framework overhead.
Step 3: Check if it fits. If total VRAM (weights + KV + overhead) exceeds the GPU's capacity, you either quantize harder, reduce concurrency, or move up a tier.
Step 4: Price it out. On GMI Cloud, H100 starts at ~$2.10/GPU-hour and H200 at ~$2.50/GPU-hour (check gmicloud.ai/pricing for current rates). If one H200 replaces two H100s for your workload, you're saving ~$1.70/hour. Multiply by 8,760 hours/year and the math gets obvious fast.
Step 5: Validate with benchmarks. Measure tokens/s (prefill and decode separately), P95 TTFT, and P95 inter-token latency at your target batch size and context length. Always report: model, precision, seq_len, batch size, serving engine (TensorRT-LLM vs. vLLM vs. TGI), and parallelism strategy. Without these, benchmark numbers aren't reproducible.
Quick Decision Tree
Your situationStart here7B–70B, FP8, latency-sensitive, multi-tenant (MIG)H100 SXM70B+, long context (32K+), decode-bound, OOM on H100H200 SXM7B–34B, existing fleet, cost/token priorityA100 80GB7B model, INT8/INT4, low concurrency, budget-firstL4100B+, future-proofing, willing to adopt earlyB200
Why GMI Cloud
This guide targets AI/ML engineers, CTOs, AI startups, and researchers who need production GPU infrastructure without building a data center. GMI Cloud (gmicloud.ai) delivers H100 and H200 SXM instances on-demand. Each node packs 8 GPUs with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU on HGX) and 3.2 Tbps InfiniBand for multi-node TP.
Instances come pre-configured with CUDA 12.x, TensorRT-LLM, vLLM, Triton, and tuned NCCL. No driver debugging. H100 instances start at ~$2.10/GPU-hour, H200 at ~$2.50/GPU-hour (check gmicloud.ai/pricing for current rates).
When you need bare-metal performance without 6–12 month cluster lead times, or when bursty workloads make fixed CapEx hard to justify, cloud GPU instances are the pragmatic path.
Conclusion
There's no single "best GPU" for LLM inference. But there's a clear starting point. H100 for today's production standard. H200 when memory bandwidth is your bottleneck. A100 for proven mid-tier value. L4 for budget inference. B200 for tomorrow's workloads.
Bottom line: size your model, budget your VRAM (don't forget KV-cache), price out the cost per token, and validate with real benchmarks. That's how you pick a GPU. Not by spec sheets.
Explore GPU instances at gmicloud.ai. Match your workload to the right hardware.
FAQ
Can I run a 70B model on a single GPU?
At FP8, weights occupy ~70 GB. On the H200 (141 GB), that leaves ~65 GB for KV-cache and overhead, which handles production concurrency. On the H100 (80 GB), ~7 GB headroom isn't enough. You'd need 2-way TP or INT4 quantization.
When should I use L4 vs. H100 for a 7B model?
If your 7B model runs well at INT8/INT4 and you don't need NVLink scaling, the L4 at 72W and lower cost gives you better ROI. If you need FP8 precision, MIG multi-tenancy, or you're also serving larger models on the same fleet, go H100.
Is the H200 worth the premium over H100?
For decode-bound workloads (most interactive LLM inference), the 1.4–1.9x throughput gain often pays for itself through fewer GPUs. At ~$2.50 vs. ~$2.10/GPU-hour, one H200 replacing two H100s saves ~$1.70/hour. For prefill-dominated or sub-80GB workloads, H100 delivers better cost per token.
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
FAQ

