Is H200 always better price-to-performance than H100?

Not always. For models that fit comfortably in H100's 80 GB with FP8, and with short context lengths, H100's lower hourly rate wins. H200 pulls ahead when models need more VRAM (long context, FP16, or 100B+ parameters) or when bandwidth is the bottleneck (decode-heavy workloads).

Can I get these optimizations without configuring them myself?

Yes. MaaS platforms apply quantization, batching, and serving optimizations internally. You pay per request and the platform handles the engineering. The tradeoff is less control over specific optimization settings.

How do I calculate my actual $/inference?

Run the workload for one hour. Count completed inferences. Divide hourly GPU cost by that number. Compare across GPU types with identical workloads. This gives an empirical $/inference that accounts for all three levers simultaneously.

Does speculative decoding work for all models?

It works best when draft and base models are from the same family (e.g., Llama 3 8B drafting for Llama 3 70B). Cross-family pairing reduces draft accuracy and lowers the speedup. Typical improvement is 2-3x when models are well-matched.

GPU Price-to-Performance for Cloud AI Inference: A Buyer's Breakdown

April 27, 2026

Picking the lowest $/GPU-hour option looks like cost optimization. Three months later, the cloud bill is 3x the projection. The problem: a cheaper GPU that's slower per inference ends up costing more per completed job. Real price-to-performance analysis means calculating cost per useful unit of work, not cost per hour of hardware. Pulling the right levers can mean the difference between a $50K/month cloud bill and a $5K one for the same workload. This article covers:

GPU selection: matching hardware to your model size and workload pattern
FP8 quantization: the single easiest optimization most teams haven't applied
Runtime optimization: continuous batching and speculative decoding for 4-8x efficiency gains

Three Levers Control Your Cost Per Inference

Price-to-performance isn't a GPU spec. It's the result of three independent decisions: which GPU you choose (hardware match), what precision you run at (quantization strategy), and how your serving stack schedules requests (runtime optimization). Each lever multiplies the effect of the others. Pulling all three correctly can deliver 4-8x better cost efficiency than a naive deployment.

Lever 1: GPU Selection: Match Hardware to Workload

Choosing the wrong GPU wastes money in both directions. Oversizing means paying for unused VRAM. Undersizing means queuing requests while the GPU struggles. Here's how each option maps to workload types:

H100 SXM (80 GB HBM3, 3.35 TB/s, from $2.00/hr): Best match for 70B-class models in FP8. Weights (~35 GB) plus KV-cache plus activations fit within 80 GB. If your primary workload is Llama 70B, Qwen 72B, or similar, H100 delivers the lowest $/inference for this model class.
H200 SXM (141 GB HBM3e, 4.8 TB/s, from $2.60/hr): Wins for 70B+ models with long context (16K-128K tokens), or when running FP16 instead of FP8. The 43% bandwidth advantage over H100 translates to higher tokens/sec, which can offset the 30% price premium. Also fits models too large for H100 in a single GPU.
A100 80GB (80 GB HBM2e, 2.0 TB/s): Only makes sense for 7B-34B models or legacy Ampere-optimized workloads. No FP8 support means you miss the biggest single optimization available on Hopper-class GPUs.
L4 (24 GB GDDR6, 300 GB/s): Budget option for lightweight models under 7B parameters, INT8/INT4 quantization. Lowest hourly cost but also lowest throughput. Good for development, testing, or low-traffic production endpoints.

Lever 2: Precision: FP8 Quantization Changes the Economics

FP8 quantization is the single easiest way to improve price-to-performance:

Halves VRAM usage: A 70B model in FP16 uses ~140 GB (needs H200 or multi-GPU). In FP8, the same model uses ~35 GB (fits on H100). Fitting on a cheaper GPU directly reduces hourly cost.
1.5-2x throughput gain: FP8 doubles the effective memory bandwidth for weight reads. On H100/H200, this translates to 1.5-2x more tokens per second, meaning 1.5-2x more inferences per GPU-hour.
Near-zero accuracy loss: Extensive testing through 2025 shows FP8 quantization produces negligible accuracy degradation on mainstream LLMs. Unless your use case requires bit-exact precision, FP8 should be your default.
A100 doesn't support FP8: This is the key reason A100's lower hourly rate often loses on price-to-performance. An H100 running FP8 at $2.00/hr can deliver more throughput than an A100 at a lower hourly rate running FP16.

Lever 3: Runtime Optimization: Squeeze the Last Gains

The serving stack determines how efficiently your GPU processes requests:

Continuous batching overlaps requests so new queries start without waiting for long-running ones to finish. This delivers 2-4x throughput improvement over static batching. vLLM and TensorRT-LLM both support it. Any serving stack still on static batching has free throughput waiting to be claimed.
Speculative decoding uses a small draft model (8B parameters) to predict tokens, then verifies with the main model (70B). The 8B model runs fast, and correct predictions (70-85% of tokens) confirm multiple tokens per decode step. Result: 2-3x throughput boost with zero accuracy loss.
KV-cache optimization reduces memory pressure and improves concurrency. Paged attention (used in vLLM) eliminates KV-cache fragmentation, allowing more concurrent requests per GPU. The KV-cache formula: 2 x layers x kv_heads x head_dim x seq_len x bytes_per_element.

Combining All Three Levers

The multiplication effect is real. Here's a simplified comparison for Llama 70B inference:

Configuration	GPU	Precision	Runtime	Relative $/Inference
Baseline	A100	FP16	Static batch	1.0x (worst)
GPU upgrade	H100	FP16	Static batch	~0.6x
+ Quantization	H100	FP8	Static batch	~0.35x
+ Runtime	H100	FP8	Continuous batch + speculative	~0.12x
Best case	H200	FP8	Continuous batch + speculative	~0.08x (best)

Moving from worst case to best case represents roughly 10-12x improvement in cost per inference. That's the difference between a $50K/month cloud bill and a $5K/month bill for the same workload.

Price-to-Performance on Optimized Infrastructure

GMI Cloud offers H100 from $2.00/GPU-hour and H200 from $2.60/GPU-hour, pre-configured with TensorRT-LLM, vLLM, and Triton Inference Server for immediate access to all three optimization levers. As an NVIDIA Preferred Partner built on NVIDIA Reference Platform Cloud Architecture, nodes include 8 GPUs with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms) and 3.2 Tbps InfiniBand. Teams that prefer per-request pricing can use the unified MaaS model library where optimization is handled by the platform. Check gmicloud.ai/pricing for current rates.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started