Is A100 still worth renting in 2026?

Only if you're running legacy Ampere-optimized code or need the cheapest possible GPU-hour for non-latency-sensitive batch jobs. For new inference workloads, H100's FP8 support and higher bandwidth make it a better starting point.

When does H200's higher hourly rate actually save money?

When your workload is decode-bound (long output generation) and the 43% bandwidth advantage translates to proportionally higher throughput. For Llama 70B at batch 64, NVIDIA's benchmarks show up to 1.9x speedup, which more than offsets the 30% price premium.

Should I wait for Blackwell GPUs instead?

If your workload runs well on H100/H200 today, don't wait. GB200 is available at $8.00/GPU-hour for teams that need next-gen performance now. B200 standalone instances are in limited availability at $4.00/GPU-hour. For most workloads, H200 delivers the best current price-performance.

How do I test which GPU is best for my workload?

Rent one instance of each (H100, H200) for a day. Run your actual model with your actual batch sizes. Measure throughput (tokens/sec or inferences/hour) and calculate cost per inference. The spreadsheet answer rarely matches the real-world answer.

A100 vs H100 vs H200: The Numbers Most Comparisons Leave Out

April 27, 2026

GPU cloud pricing comparisons usually start and end with the hourly rate. But an H200 at $2.60/hour that finishes a job 1.9x faster than an H100 at $2.00/hour actually costs less per inference. Getting this math right can mean the difference between a cloud bill that scales linearly and one that spirals. Most GPU comparisons stop at spec sheets and never reach the number that matters: cost per actual workload completed. This article covers:

Raw spec comparison across A100, H100, and H200
How hourly pricing translates to per-inference economics
A decision tree for matching GPU to your workload

Three Dimensions Separate These GPUs

Hourly pricing is just one variable. The real comparison requires evaluating VRAM capacity (what models fit), memory bandwidth (how fast inference runs), and the resulting cost per completed inference. A GPU that's 30% more expensive per hour but 90% faster per job wins on total cost. Missing this calculation is how teams end up overspending on "cheaper" hardware.

Spec Comparison: What the Datasheets Say

Start with the raw numbers. These come from NVIDIA's official datasheets:

Spec	A100 80GB	H100 SXM	H200 SXM
Architecture	Ampere	Hopper	Hopper
VRAM	80 GB HBM2e	80 GB HBM3	141 GB HBM3e
Memory BW	2.0 TB/s	3.35 TB/s	4.8 TB/s
FP8	Not supported	1,979 TFLOPS	1,979 TFLOPS
INT8	624 TOPS	3,958 TOPS	3,958 TOPS
NVLink	600 GB/s	900 GB/s*	900 GB/s*
TDP	400W	700W	700W

*bidirectional aggregate per GPU on HGX/DGX platforms

Two things jump out. First, H100 and H200 share identical compute (same TFLOPS), but H200's memory is 76% larger and 43% faster. Second, A100 lacks FP8 support entirely, which locks you out of the quantization optimization that gives H100/H200 their biggest throughput gains.

Pricing: Hourly Rates vs Cost Per Inference

Here's where most comparisons get lazy. They quote hourly rates and stop:

A100 80GB: Legacy pricing varies. Many providers are phasing out A100. Contact for current rates.
H100 SXM: From $2.00/GPU-hour on optimized cloud platforms.
H200 SXM: From $2.60/GPU-hour. That's a 30% premium over H100.

But hourly rate isn't the real cost. The real cost is: (hourly rate) / (inferences completed per hour). If H200's 4.8 TB/s bandwidth lets it complete 1.5-1.9x more inferences per hour than H100 on the same model, the per-inference cost on H200 can be lower despite the higher hourly rate.

NVIDIA's official benchmarks show H200 delivering up to 1.9x inference speedup on Llama 2 70B versus H100 (tested with TensorRT-LLM, FP8, batch 64, 128/2048 tokens). At that speedup, H200's $2.60/hour produces more throughput per dollar than H100's $2.00/hour.

Workload Matching: Which GPU for Which Job

The right GPU depends on your model and workload pattern:

A100 80GB fits teams running 7B-34B models on a budget, with existing Ampere-era code that hasn't been ported to FP8. It's also viable for fine-tuning workloads where memory capacity matters more than inference speed. But without FP8, you're leaving 1.5-2x throughput on the table for inference.
H100 SXM is the workhorse for 70B-class models with FP8 quantization. 80 GB VRAM fits Llama 70B in FP8 (~35 GB weights) with room for KV-cache. MIG support lets you partition one H100 into up to 7 smaller instances for serving multiple lightweight models.
H200 SXM dominates for 70B+ models with long context windows. The 141 GB VRAM fits Llama 70B in FP16 (~140 GB) or runs FP8 models with massive KV-cache budgets for 32K+ context. The 4.8 TB/s bandwidth makes it the clear winner for decode-heavy workloads where memory bandwidth is the bottleneck.

Decision Tree: Size, Precision, Budget

The choice reduces to three questions:

Model size? Under 34B parameters: A100 or H100. 70B parameters: H100 (FP8) or H200. Over 100B: H200 or multi-GPU.
FP8 viable? If yes, H100 and H200 both gain 1.5-2x throughput. If no (legacy code or accuracy requirements), A100 is the only option without FP8.
Budget priority? Optimizing $/inference favors H200 at higher hourly cost but faster completion. Optimizing $/hour for bursty, unpredictable workloads favors H100 as the lowest entry point on current hardware.

Running These GPUs on Optimized Infrastructure

GMI Cloud offers H100 SXM from $2.00/GPU-hour and H200 SXM from $2.60/GPU-hour, with GB200 at $8.00/GPU-hour for next-generation Blackwell workloads. As an NVIDIA Preferred Partner built on NVIDIA Reference Platform Cloud Architecture, nodes run 8 GPUs with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU) and 3.2 Tbps InfiniBand interconnect. Pre-configured with TensorRT-LLM, vLLM, and Triton Inference Server. Teams that don't want to manage GPUs at all can use the unified MaaS model library with 100+ pre-deployed models on per-request pricing. Check gmicloud.ai/pricing for current rates.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started