other

A100 vs H100 vs H200: The Numbers Most Comparisons Leave Out

April 27, 2026

GPU cloud pricing comparisons usually start and end with the hourly rate. But an H200 at $2.60/hour that finishes a job 1.9x faster than an H100 at $2.00/hour actually costs less per inference. Getting this math right can mean the difference between a cloud bill that scales linearly and one that spirals. Most GPU comparisons stop at spec sheets and never reach the number that matters: cost per actual workload completed. This article covers:

  • Raw spec comparison across A100, H100, and H200
  • How hourly pricing translates to per-inference economics
  • A decision tree for matching GPU to your workload

Three Dimensions Separate These GPUs

Hourly pricing is just one variable. The real comparison requires evaluating VRAM capacity (what models fit), memory bandwidth (how fast inference runs), and the resulting cost per completed inference. A GPU that's 30% more expensive per hour but 90% faster per job wins on total cost. Missing this calculation is how teams end up overspending on "cheaper" hardware.

Spec Comparison: What the Datasheets Say

Start with the raw numbers. These come from NVIDIA's official datasheets:

Spec A100 80GB H100 SXM H200 SXM
Architecture Ampere Hopper Hopper
VRAM 80 GB HBM2e 80 GB HBM3 141 GB HBM3e
Memory BW 2.0 TB/s 3.35 TB/s 4.8 TB/s
FP8 Not supported 1,979 TFLOPS 1,979 TFLOPS
INT8 624 TOPS 3,958 TOPS 3,958 TOPS
NVLink 600 GB/s 900 GB/s* 900 GB/s*
TDP 400W 700W 700W

*bidirectional aggregate per GPU on HGX/DGX platforms

Two things jump out. First, H100 and H200 share identical compute (same TFLOPS), but H200's memory is 76% larger and 43% faster. Second, A100 lacks FP8 support entirely, which locks you out of the quantization optimization that gives H100/H200 their biggest throughput gains.

Pricing: Hourly Rates vs Cost Per Inference

Here's where most comparisons get lazy. They quote hourly rates and stop:

  • A100 80GB: Legacy pricing varies. Many providers are phasing out A100. Contact for current rates.
  • H100 SXM: From $2.00/GPU-hour on optimized cloud platforms.
  • H200 SXM: From $2.60/GPU-hour. That's a 30% premium over H100.

But hourly rate isn't the real cost. The real cost is: (hourly rate) / (inferences completed per hour). If H200's 4.8 TB/s bandwidth lets it complete 1.5-1.9x more inferences per hour than H100 on the same model, the per-inference cost on H200 can be lower despite the higher hourly rate.

NVIDIA's official benchmarks show H200 delivering up to 1.9x inference speedup on Llama 2 70B versus H100 (tested with TensorRT-LLM, FP8, batch 64, 128/2048 tokens). At that speedup, H200's $2.60/hour produces more throughput per dollar than H100's $2.00/hour.

Workload Matching: Which GPU for Which Job

The right GPU depends on your model and workload pattern:

  • A100 80GB fits teams running 7B-34B models on a budget, with existing Ampere-era code that hasn't been ported to FP8. It's also viable for fine-tuning workloads where memory capacity matters more than inference speed. But without FP8, you're leaving 1.5-2x throughput on the table for inference.

  • H100 SXM is the workhorse for 70B-class models with FP8 quantization. 80 GB VRAM fits Llama 70B in FP8 (~35 GB weights) with room for KV-cache. MIG support lets you partition one H100 into up to 7 smaller instances for serving multiple lightweight models.

  • H200 SXM dominates for 70B+ models with long context windows. The 141 GB VRAM fits Llama 70B in FP16 (~140 GB) or runs FP8 models with massive KV-cache budgets for 32K+ context. The 4.8 TB/s bandwidth makes it the clear winner for decode-heavy workloads where memory bandwidth is the bottleneck.

Decision Tree: Size, Precision, Budget

The choice reduces to three questions:

  • Model size? Under 34B parameters: A100 or H100. 70B parameters: H100 (FP8) or H200. Over 100B: H200 or multi-GPU.

  • FP8 viable? If yes, H100 and H200 both gain 1.5-2x throughput. If no (legacy code or accuracy requirements), A100 is the only option without FP8.

  • Budget priority? Optimizing $/inference favors H200 at higher hourly cost but faster completion. Optimizing $/hour for bursty, unpredictable workloads favors H100 as the lowest entry point on current hardware.

Running These GPUs on Optimized Infrastructure

GMI Cloud offers H100 SXM from $2.00/GPU-hour and H200 SXM from $2.60/GPU-hour, with GB200 at $8.00/GPU-hour for next-generation Blackwell workloads. As an NVIDIA Preferred Partner built on NVIDIA Reference Platform Cloud Architecture, nodes run 8 GPUs with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU) and 3.2 Tbps InfiniBand interconnect. Pre-configured with TensorRT-LLM, vLLM, and Triton Inference Server. Teams that don't want to manage GPUs at all can use the unified MaaS model library with 100+ pre-deployed models on per-request pricing. Check gmicloud.ai/pricing for current rates.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started