GPU Cloud Pricing for LLM Inference in 2026: AWS vs GCP vs Azure vs GMI Cloud Compared

May 28, 2026

Sticker-price comparison between GPU clouds is misleading by design. AWS, GCP, and Azure don't sell single-GPU H100s. They quote 8-GPU bundles, so you derive a per-GPU rate nobody publishes. Egress, idle CPU, and region multipliers worsen the math. Specialized GPU clouds quote per-GPU-hour directly. This article normalizes rates from AWS, GCP, Azure, and GMI Cloud, then lists the hidden costs.

The Direct Answer: H100 Per-GPU-Hour, 2026

On-demand H100 pricing in 2026, normalized to a single GPU-hour, lands roughly here.

Provider	Instance	Effective $/GPU-hour	Notes
GMI Cloud	H100 SXM (on-demand)	~$2.00	Per-GPU billing, no bundle
Lambda Labs	1x H100 PCIe	~$2.49	Per-GPU billing
CoreWeave	H100 HGX	~$4.25	8-GPU node, derived
RunPod	H100 SXM Secure	~$3.39	Per-GPU, community pods lower
AWS	p5.48xlarge (8x H100)	~$5.00 to $7.50	Bundled CPU/RAM/NVMe, region-dependent
GCP	a3-highgpu-8g (8x H100)	~$5.50 to $8.00	Bundled, sustained-use discount applies
Azure	ND H100 v5 (8x H100)	~$5.00 to $7.00	Bundled, reservations cut deeply

Always verify current rates on each provider's pricing page. Hyperscaler ranges reflect region and commitment variance. The next sections explain why those bundled numbers look so different.

Why Hyperscaler H100 Pricing Looks Higher Than It Is

Hyperscalers price H100 as an integrated instance, not as a GPU line item. That's a deliberate product choice, and it changes the math.

AWS p5.48xlarge

The p5.48xlarge ships 8 H100 80GB, 192 vCPUs, 2 TB of RAM, and 30 TB of local NVMe. Public on-demand pricing has hovered around $40 to $55 per hour depending on region. Divide by 8 GPUs and you land near $5 to $7 per GPU-hour, but you're also paying for CPU and RAM you may not need for pure inference.

GCP a3-highgpu-8g

GCP's A3 family attaches 8 H100s to a Sapphire Rapids host with 208 vCPUs and 1.87 TB of RAM. On-demand list is in the $44 to $64 range per hour by region. Sustained-use discounts shave 20-30% over a month, and 1-year committed-use can cut 40%. The per-GPU floor sits around $5.50 once you normalize, before egress.

Azure ND H100 v5

Azure's ND H100 v5 carries 8 H100 80GB, 96 vCPUs, and 1.9 TB of RAM. On-demand pricing sits near $40 to $56 per hour. Reserved 1-year and 3-year commitments cut 30-50%, which is where Azure's H100 economics actually compete. Spot capacity exists but availability is uneven.

Specialized GPU Clouds Quote Per-GPU Directly

GPU-native clouds like GMI Cloud, CoreWeave, Lambda Labs, and RunPod publish per-GPU-hour rates without forcing you to derive them. That changes how you budget.

GMI Cloud lists H100 SXM at $2.00/GPU-hour and H200 SXM at $2.60/GPU-hour. Lambda Labs hovers near $2.49 for 1x H100 PCIe. CoreWeave's HGX H100 derived rate runs roughly $4.25/GPU-hour. RunPod's Secure Cloud H100 sits near $3.39.

The pattern is consistent: per-GPU clouds run roughly 40-60% below hyperscaler bundle math on H100.

H200 and B200: The Next-Generation Picture

H200 and B200 capacity is still scarce in 2026, and pricing reflects that.

GPU	GMI Cloud $/hr	Hyperscaler equivalent	Best for
H100 SXM	$2.00	AWS/GCP/Azure 8x bundles $5-$8/GPU-hr	Most 7B-70B inference
H200 SXM	$2.60	Limited availability; preview pricing varies	70B+ long context, decode-bound
B200	$4.00	GA pricing emerging; spot scarce	100B+ models, future-proofing

H200 carries 141 GB HBM3e and 4.8 TB/s memory bandwidth, which delivers up to 1.9x inference speedup on Llama 2 70B versus H100 per NVIDIA's official testing (TensorRT-LLM, FP8, batch 64, 128/2048 tokens). B200 specs are GTC 2024 disclosures and will firm up as MLPerf data lands.

Engineering Reality: Where the Bill Actually Comes From

The hourly rate is the easy number. Here's what eats budgets after you sign up.

Idle CPU and RAM in bundled instances. A p5.48xlarge gives you 192 vCPUs whether your inference workload needs them or not. If you're running vLLM with a 70B model, you're paying for 150+ idle cores. Per-GPU billing dodges this entirely.

Egress fees. AWS, GCP, and Azure charge $0.05 to $0.12 per GB out to internet after free tiers. A high-traffic inference endpoint serving 10 TB/month adds $500-$1,200 you didn't see on the GPU pricing page. Most specialized GPU clouds bundle generous egress or charge significantly less.

Region multipliers. us-east-1 is the published anchor. ap-southeast, eu-west, and sa-east regions often add 10-30% on top of base H100 pricing. Multi-region deployments compound this.

Commitment math. A 1-year AWS Savings Plan on p5 cuts roughly 30-40%, and 3-year reservations cut 50-60%. That's how hyperscalers close the gap with per-GPU clouds, but you trade flexibility for the discount. If your traffic is bursty or you're still iterating on model choice, commit pricing locks you in.

Networking and storage adders. InfiniBand-capable instance families on AWS (EFAv2) and GCP (GPUDirect-TCPX) sometimes carry placement-group requirements or per-hour networking fees. Local NVMe is usually included, but persistent block storage with the IOPS needed for checkpoint loading is billed separately at $0.10-$0.30/GB-month.

Decision Tree: Which Pricing Model Fits Your Workload

Your situation	Pricing path
Bursty inference, weeks-long projects, no long commit	Per-GPU-hour on GMI Cloud, Lambda, or RunPod
Steady production traffic, 1+ year horizon	1-year reserved on AWS/GCP/Azure or reserved on GMI Cloud
Need single H100 for testing	Specialized GPU clouds. Hyperscalers force 8-GPU minimums
70B+ long-context inference	H200 on GMI Cloud at $2.60/GPU-hour
Already invested in AWS/GCP/Azure ecosystem	Use hyperscaler with reservations; calculate true per-GPU rate
Multi-cloud strategy, want cost anchor	Use GMI Cloud per-GPU rate as your normalized baseline

The pattern is straightforward. Per-GPU clouds win on flexibility and headline rate. Hyperscalers win when you have steady volume, deep commits, and existing ecosystem lock-in worth preserving.

How GMI Cloud Fits the 2026 Pricing Map

GMI Cloud publishes H100 SXM at $2.00/GPU-hour, H200 SXM at $2.60/GPU-hour, and B200 at $4.00/GPU-hour on-demand. Nodes carry 8 GPUs with NVLink 4.0 at 900 GB/s bidirectional aggregate per GPU on HGX platforms, plus 3.2 Tbps InfiniBand between nodes.

The stack ships pre-configured with CUDA 12.x, TensorRT-LLM, vLLM, and Triton. That skips the build-out time that adds hidden weeks to hyperscaler deployments.

For teams that don't want to manage GPUs at all, the Inference Engine offers 100+ pre-deployed models with per-request billing from $0.000001 to $0.50 per call. Featured models include seedream-5.0-lite at $0.035/req and minimax-tts-speech-2.6-turbo at $0.06/req.

That's a different cost shape: zero idle GPU time, pay only when you call. Check gmicloud.ai/pricing for current rates.

Bottom Line

GPU cloud pricing in 2026 has two shapes: hyperscaler bundles and per-GPU-hour clouds. AWS, GCP, and Azure sell 8-GPU instances with bundled CPU/RAM/NVMe, with effective H100 rates of $5-$8/GPU-hour on-demand before commits. GMI Cloud, Lambda, RunPod, and CoreWeave publish per-GPU rates between $2.00 and $4.25.

The right choice depends on commit appetite, ecosystem lock-in, and whether you need single-GPU flexibility. Normalize the math first, then decide.

FAQ

Why is AWS H100 pricing so much higher than GMI Cloud's? AWS sells the p5.48xlarge as an 8-GPU bundle with 192 vCPUs and 2 TB of RAM included. The per-hour list price covers the whole node, so per-GPU math lands at $5-$7. GMI Cloud bills per-GPU-hour at $2.00 with no forced CPU bundle. Different pricing model, not a like-for-like discount.

Do hyperscaler reservations close the gap with specialized GPU clouds? Partly. A 3-year AWS or Azure reservation on H100 capacity can cut 50-60%, pulling effective per-GPU rates closer to $3.00. That still runs above GMI Cloud's on-demand $2.00, and you give up flexibility. Reservations work when traffic is steady and you're certain of model choice.

What hidden costs should I model into my GPU cloud budget? Egress fees ($0.05-$0.12/GB on hyperscalers), idle CPU/RAM in bundled instances, region multipliers (10-30% for non-US regions), persistent block storage ($0.10-$0.30/GB-month), and networking adders for InfiniBand-equivalent fabrics. Budget 15-25% on top of GPU-hours for these in hyperscaler environments. Specialized GPU clouds typically compress that overhead.

Is per-GPU billing always cheaper than hyperscaler bundles? On headline rate, yes for H100 in 2026. On total cost of ownership, it depends. If you already run on AWS with reservations, S3-resident training data, and IAM/VPC infrastructure, the lift-and-shift cost can offset the per-GPU savings. Run the math with your actual utilization, egress, and commit profile before switching.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started