GPU Cloud Pricing Looks Simple Until You Try to Compare It
May 12, 2026
Every GPU cloud publishes a pricing page. One quotes per-GPU-hour. Another quotes per-million-tokens. A third quotes per-request. Comparing them side by side requires math that most pricing pages don't encourage you to do.
The result: teams pick providers based on headline rates and discover the real cost after the first invoice. This article breaks down why GPU cloud pricing for LLM inference resists easy comparison, how to normalize costs to a common unit, and where GMI Cloud fits in the landscape.
Three Pricing Models, Three Different Units
GPU cloud pricing for inference falls into three categories. Each measures cost in a different unit, making direct comparison impossible without conversion.
Per-GPU-hour charges a flat rate for GPU time regardless of utilization. An H100 at $2.10/hour costs the same whether it processes 1,000 requests or 100,000. This model rewards high utilization and punishes idle time.
Per-token charges based on input and output tokens processed. A provider might charge $0.30 per million input tokens and $0.60 per million output tokens. This model is transparent per-request but hides infrastructure efficiency.
Per-request charges a fixed price per API call regardless of token count. Video and image models commonly use this model. A single video generation call might cost $0.15 whether the output is 3 seconds or 10 seconds.
The problem: a team comparing Provider A at $2.10/GPU-hour with Provider B at $0.40/million tokens has no basis for comparison without knowing throughput, batch size, and utilization rate.
The Normalization Math Most Teams Skip
To compare pricing models, normalize everything to one common unit: cost per million output tokens at your expected workload.
From per-GPU-hour to per-token:
The formula is: cost per million tokens = (GPU hourly rate / tokens per second / 3,600) × 1,000,000.
An H100 running Llama 70B in FP8 with vLLM at batch size 32 generates roughly 2,000-3,000 tokens per second. At $2.10/hour, that's approximately $0.19-$0.29 per million output tokens. The same GPU at 50% utilization doubles the effective cost to $0.38-$0.58 per million tokens.
From per-token to per-GPU-hour equivalent:
Reverse the formula: if a MaaS provider charges $0.40/million output tokens and you generate 100 million tokens per month, the monthly cost is $40. On a self-hosted H100 at $2.10/hour running 24/7 ($1,533/month), you'd need to generate roughly 5-8 billion tokens per month to break even. Below that volume, MaaS wins.
The utilization variable is what makes or breaks self-hosted economics. The table below shows how effective cost per million tokens changes with utilization on an H100:
| GPU Utilization | Effective $/M Output Tokens | vs MaaS at $0.40/M |
|---|---|---|
| 90% | ~$0.22 | 45% cheaper |
| 70% | ~$0.28 | 30% cheaper |
| 50% | ~$0.39 | Roughly equal |
| 30% | ~$0.65 | 63% more expensive |
At 50% utilization, self-hosted and MaaS pricing roughly converge. Below that, MaaS is almost always cheaper.
What the Headline Rate Doesn't Include
Even after normalizing units, several cost components sit outside the headline rate.
Networking and egress. Transferring model outputs to your application incurs egress fees on most platforms. Some charge $0.05-$0.12 per GB; others include egress in the GPU rate. At scale, egress can add 5-15% to your effective cost.
Idle time between requests. Per-GPU-hour billing charges for every second the GPU is allocated, including gaps between requests. During off-peak hours, utilization can drop below 30%, inflating the effective per-token cost. Auto-scaling helps but introduces cold-start latency.
Storage for model weights. Large models require persistent storage. A 70B model in FP8 occupies roughly 35 GB. Storing multiple model versions or fine-tuned checkpoints adds storage costs that don't appear on the GPU pricing page.
Engineering overhead. Self-hosted GPU instances require runtime maintenance, monitoring setup, and scaling logic. These engineering hours have a cost that per-token MaaS pricing absorbs into the per-request rate.
How Major Providers Price LLM Inference
Applying the normalization framework to current market pricing reveals where each provider sits.
Hyperscalers (AWS, GCP, Azure): On-demand H100 pricing ranges from ~$3.00/GPU-hour (GCP) to ~$6.98/GPU-hour (Azure). Per-token pricing is available through managed services (Bedrock, Vertex AI). Strength: ecosystem integration and enterprise support. Cost trade-off: highest per-GPU-hour rates in the market.
Specialized GPU clouds (RunPod, Lambda, Vast.ai): H100 pricing ranges from ~$1.38/hour (ThunderCompute) to ~$1.99/hour (RunPod). Strength: lower hourly rates and faster provisioning. Cost trade-off: less enterprise tooling, variable availability.
MaaS / API providers (Together AI, Fireworks, SiliconFlow): Per-token pricing with no GPU management. Strength: zero infrastructure overhead, cost scales linearly with usage. Cost trade-off: less control over batching, quantization, and model versions.
GMI Cloud: Dual pricing model. GPU instances: H100 SXM at ~$2.10/GPU-hour, H200 SXM at ~$2.50/GPU-hour. Inference Engine: 100+ pre-deployed models with per-request pricing ($0.000001-$0.50/request). The dual model lets teams start with per-request pricing for low-volume workloads and migrate to dedicated GPUs as utilization justifies it.
A Decision Framework for Pricing Model Selection
The right pricing model depends on three variables: monthly volume, utilization predictability, and engineering capacity.
| Monthly Token Volume | Utilization Pattern | Recommended Model | Reason |
|---|---|---|---|
| Under 10M tokens | Unpredictable | MaaS / per-request | No idle cost, zero overhead |
| 10M-200M tokens | Moderate, growing | On-demand GPU | Cheaper per-token above 50% utilization |
| 200M+ tokens | Steady, high | Reserved GPU | 30-50% discount on committed capacity |
| Mixed (text + media) | Variable by type | Hybrid (MaaS + GPU) | Per-request for media, GPU for text |
The hybrid approach deserves attention: use MaaS for workloads with unpredictable volume (image generation, video) and dedicated GPUs for steady-state text inference. This avoids over-provisioning for bursty workloads while keeping per-token costs low for high-volume ones.
GMI Cloud Pricing Infrastructure
GMI Cloud is worth evaluating for teams that want both pricing models on one platform.
GPU instances: H100 SXM (80 GB HBM3, 3.35 TB/s, ~$2.10/GPU-hour) and H200 SXM (141 GB HBM3e, 4.8 TB/s, ~$2.50/GPU-hour). Each node: 8 GPUs, NVLink 4.0 (900 GB/s bidirectional per GPU on HGX/DGX platforms), 3.2 Tbps InfiniBand. Pre-installed: TensorRT-LLM, vLLM, Triton, CUDA 12.x, NCCL.
Inference Engine: 100+ models with per-request pricing. Text models, video generation ($0.03-$0.50/request), image generation ($0.007-$0.134/request), and audio TTS ($0.005-$0.10/request). No GPU provisioning required.
Teams should run the normalization math above against their own workload before committing. Check gmicloud.ai/pricing for current rates.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
