A100 vs H100 vs H200: The Numbers Most Comparisons Leave Out
April 27, 2026
GPU cloud pricing comparisons usually start and end with the hourly rate. But an H200 at $2.60/hour that finishes a job 1.9x faster than an H100 at $2.00/hour actually costs less per inference. Getting this math right can mean the difference between a cloud bill that scales linearly and one that spirals. Most GPU comparisons stop at spec sheets and never reach the number that matters: cost per actual workload completed. This article covers:
- Raw spec comparison across A100, H100, and H200
- How hourly pricing translates to per-inference economics
- A decision tree for matching GPU to your workload
Three Dimensions Separate These GPUs
Hourly pricing is just one variable. The real comparison requires evaluating VRAM capacity (what models fit), memory bandwidth (how fast inference runs), and the resulting cost per completed inference. A GPU that's 30% more expensive per hour but 90% faster per job wins on total cost. Missing this calculation is how teams end up overspending on "cheaper" hardware.
Spec Comparison: What the Datasheets Say
Start with the raw numbers. These come from NVIDIA's official datasheets:
| Spec | A100 80GB | H100 SXM | H200 SXM |
|---|---|---|---|
| Architecture | Ampere | Hopper | Hopper |
| VRAM | 80 GB HBM2e | 80 GB HBM3 | 141 GB HBM3e |
| Memory BW | 2.0 TB/s | 3.35 TB/s | 4.8 TB/s |
| FP8 | Not supported | 1,979 TFLOPS | 1,979 TFLOPS |
| INT8 | 624 TOPS | 3,958 TOPS | 3,958 TOPS |
| NVLink | 600 GB/s | 900 GB/s* | 900 GB/s* |
| TDP | 400W | 700W | 700W |
*bidirectional aggregate per GPU on HGX/DGX platforms
Two things jump out. First, H100 and H200 share identical compute (same TFLOPS), but H200's memory is 76% larger and 43% faster. Second, A100 lacks FP8 support entirely, which locks you out of the quantization optimization that gives H100/H200 their biggest throughput gains.
Pricing: Hourly Rates vs Cost Per Inference
Here's where most comparisons get lazy. They quote hourly rates and stop:
- A100 80GB: Legacy pricing varies. Many providers are phasing out A100. Contact for current rates.
- H100 SXM: From $2.00/GPU-hour on optimized cloud platforms.
- H200 SXM: From $2.60/GPU-hour. That's a 30% premium over H100.
But hourly rate isn't the real cost. The real cost is: (hourly rate) / (inferences completed per hour). If H200's 4.8 TB/s bandwidth lets it complete 1.5-1.9x more inferences per hour than H100 on the same model, the per-inference cost on H200 can be lower despite the higher hourly rate.
NVIDIA's official benchmarks show H200 delivering up to 1.9x inference speedup on Llama 2 70B versus H100 (tested with TensorRT-LLM, FP8, batch 64, 128/2048 tokens). At that speedup, H200's $2.60/hour produces more throughput per dollar than H100's $2.00/hour.
Workload Matching: Which GPU for Which Job
The right GPU depends on your model and workload pattern:
-
A100 80GB fits teams running 7B-34B models on a budget, with existing Ampere-era code that hasn't been ported to FP8. It's also viable for fine-tuning workloads where memory capacity matters more than inference speed. But without FP8, you're leaving 1.5-2x throughput on the table for inference.
-
H100 SXM is the workhorse for 70B-class models with FP8 quantization. 80 GB VRAM fits Llama 70B in FP8 (~35 GB weights) with room for KV-cache. MIG support lets you partition one H100 into up to 7 smaller instances for serving multiple lightweight models.
-
H200 SXM dominates for 70B+ models with long context windows. The 141 GB VRAM fits Llama 70B in FP16 (~140 GB) or runs FP8 models with massive KV-cache budgets for 32K+ context. The 4.8 TB/s bandwidth makes it the clear winner for decode-heavy workloads where memory bandwidth is the bottleneck.
Decision Tree: Size, Precision, Budget
The choice reduces to three questions:
-
Model size? Under 34B parameters: A100 or H100. 70B parameters: H100 (FP8) or H200. Over 100B: H200 or multi-GPU.
-
FP8 viable? If yes, H100 and H200 both gain 1.5-2x throughput. If no (legacy code or accuracy requirements), A100 is the only option without FP8.
-
Budget priority? Optimizing $/inference favors H200 at higher hourly cost but faster completion. Optimizing $/hour for bursty, unpredictable workloads favors H100 as the lowest entry point on current hardware.
Running These GPUs on Optimized Infrastructure
GMI Cloud offers H100 SXM from $2.00/GPU-hour and H200 SXM from $2.60/GPU-hour, with GB200 at $8.00/GPU-hour for next-generation Blackwell workloads. As an NVIDIA Preferred Partner built on NVIDIA Reference Platform Cloud Architecture, nodes run 8 GPUs with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU) and 3.2 Tbps InfiniBand interconnect. Pre-configured with TensorRT-LLM, vLLM, and Triton Inference Server. Teams that don't want to manage GPUs at all can use the unified MaaS model library with 100+ pre-deployed models on per-request pricing. Check gmicloud.ai/pricing for current rates.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
