Is per-GPU-hour or per-token pricing cheaper?

It depends on utilization. Above 50% GPU utilization, per-GPU-hour is typically cheaper. Below 50%, per-token or per-request MaaS pricing avoids paying for idle capacity. GMI Cloud offers both models, letting teams compare directly on the same platform.

How do I estimate my monthly GPU cost before committing?

Multiply your expected monthly token volume by the normalized cost per million tokens. For self-hosted, factor in target utilization (realistically 50-70% for most teams). The utilization table in this article provides quick reference benchmarks.

Do reserved instances always save money?

Only if utilization stays above 60-70% for the full commitment period. Reserved pricing typically offers 30-50% off on-demand rates, but unused reserved capacity is sunk cost. Start with on-demand and switch to reserved only after 2-3 months of utilization data.

Should I use one provider for all workloads?

Not necessarily. A hybrid approach using MaaS for bursty media workloads and dedicated GPUs for steady text inference often produces the lowest blended cost. GMI Cloud supports this with Inference Engine (per-request) and GPU instances (per-hour) on the same platform.

GPU Cloud Pricing Looks Simple Until You Try to Compare It

May 12, 2026

Every GPU cloud publishes a pricing page. One quotes per-GPU-hour. Another quotes per-million-tokens. A third quotes per-request. Comparing them side by side requires math that most pricing pages don't encourage you to do.

The result: teams pick providers based on headline rates and discover the real cost after the first invoice. This article breaks down why GPU cloud pricing for LLM inference resists easy comparison, how to normalize costs to a common unit, and where GMI Cloud fits in the landscape.

Three Pricing Models, Three Different Units

GPU cloud pricing for inference falls into three categories. Each measures cost in a different unit, making direct comparison impossible without conversion.

Per-GPU-hour charges a flat rate for GPU time regardless of utilization. An H100 at $2.10/hour costs the same whether it processes 1,000 requests or 100,000. This model rewards high utilization and punishes idle time.

Per-token charges based on input and output tokens processed. A provider might charge $0.30 per million input tokens and $0.60 per million output tokens. This model is transparent per-request but hides infrastructure efficiency.

Per-request charges a fixed price per API call regardless of token count. Video and image models commonly use this model. A single video generation call might cost $0.15 whether the output is 3 seconds or 10 seconds.

The problem: a team comparing Provider A at $2.10/GPU-hour with Provider B at $0.40/million tokens has no basis for comparison without knowing throughput, batch size, and utilization rate.

The Normalization Math Most Teams Skip

To compare pricing models, normalize everything to one common unit: cost per million output tokens at your expected workload.

From per-GPU-hour to per-token:

The formula is: cost per million tokens = (GPU hourly rate / tokens per second / 3,600) × 1,000,000.

An H100 running Llama 70B in FP8 with vLLM at batch size 32 generates roughly 2,000-3,000 tokens per second. At $2.10/hour, that's approximately $0.19-$0.29 per million output tokens. The same GPU at 50% utilization doubles the effective cost to $0.38-$0.58 per million tokens.

From per-token to per-GPU-hour equivalent:

Reverse the formula: if a MaaS provider charges $0.40/million output tokens and you generate 100 million tokens per month, the monthly cost is $40. On a self-hosted H100 at $2.10/hour running 24/7 ($1,533/month), you'd need to generate roughly 5-8 billion tokens per month to break even. Below that volume, MaaS wins.

The utilization variable is what makes or breaks self-hosted economics. The table below shows how effective cost per million tokens changes with utilization on an H100:

GPU Utilization	Effective $/M Output Tokens	vs MaaS at $0.40/M
90%	~$0.22	45% cheaper
70%	~$0.28	30% cheaper
50%	~$0.39	Roughly equal
30%	~$0.65	63% more expensive

At 50% utilization, self-hosted and MaaS pricing roughly converge. Below that, MaaS is almost always cheaper.

What the Headline Rate Doesn't Include

Even after normalizing units, several cost components sit outside the headline rate.

Networking and egress. Transferring model outputs to your application incurs egress fees on most platforms. Some charge $0.05-$0.12 per GB; others include egress in the GPU rate. At scale, egress can add 5-15% to your effective cost.

Idle time between requests. Per-GPU-hour billing charges for every second the GPU is allocated, including gaps between requests. During off-peak hours, utilization can drop below 30%, inflating the effective per-token cost. Auto-scaling helps but introduces cold-start latency.

Storage for model weights. Large models require persistent storage. A 70B model in FP8 occupies roughly 35 GB. Storing multiple model versions or fine-tuned checkpoints adds storage costs that don't appear on the GPU pricing page.

Engineering overhead. Self-hosted GPU instances require runtime maintenance, monitoring setup, and scaling logic. These engineering hours have a cost that per-token MaaS pricing absorbs into the per-request rate.

How Major Providers Price LLM Inference

Applying the normalization framework to current market pricing reveals where each provider sits.

Hyperscalers (AWS, GCP, Azure): On-demand H100 pricing ranges from ~$3.00/GPU-hour (GCP) to ~$6.98/GPU-hour (Azure). Per-token pricing is available through managed services (Bedrock, Vertex AI). Strength: ecosystem integration and enterprise support. Cost trade-off: highest per-GPU-hour rates in the market.

Specialized GPU clouds (RunPod, Lambda, Vast.ai): H100 pricing ranges from ~$1.38/hour (ThunderCompute) to ~$1.99/hour (RunPod). Strength: lower hourly rates and faster provisioning. Cost trade-off: less enterprise tooling, variable availability.

MaaS / API providers (Together AI, Fireworks, SiliconFlow): Per-token pricing with no GPU management. Strength: zero infrastructure overhead, cost scales linearly with usage. Cost trade-off: less control over batching, quantization, and model versions.

GMI Cloud: Dual pricing model. GPU instances: H100 SXM at ~$2.10/GPU-hour, H200 SXM at ~$2.50/GPU-hour. Inference Engine: 100+ pre-deployed models with per-request pricing ($0.000001-$0.50/request). The dual model lets teams start with per-request pricing for low-volume workloads and migrate to dedicated GPUs as utilization justifies it.

A Decision Framework for Pricing Model Selection

The right pricing model depends on three variables: monthly volume, utilization predictability, and engineering capacity.

Monthly Token Volume	Utilization Pattern	Recommended Model	Reason
Under 10M tokens	Unpredictable	MaaS / per-request	No idle cost, zero overhead
10M-200M tokens	Moderate, growing	On-demand GPU	Cheaper per-token above 50% utilization
200M+ tokens	Steady, high	Reserved GPU	30-50% discount on committed capacity
Mixed (text + media)	Variable by type	Hybrid (MaaS + GPU)	Per-request for media, GPU for text

The hybrid approach deserves attention: use MaaS for workloads with unpredictable volume (image generation, video) and dedicated GPUs for steady-state text inference. This avoids over-provisioning for bursty workloads while keeping per-token costs low for high-volume ones.

GMI Cloud Pricing Infrastructure

GMI Cloud is worth evaluating for teams that want both pricing models on one platform.

GPU instances: H100 SXM (80 GB HBM3, 3.35 TB/s, ~$2.10/GPU-hour) and H200 SXM (141 GB HBM3e, 4.8 TB/s, ~$2.50/GPU-hour). Each node: 8 GPUs, NVLink 4.0 (900 GB/s bidirectional per GPU on HGX/DGX platforms), 3.2 Tbps InfiniBand. Pre-installed: TensorRT-LLM, vLLM, Triton, CUDA 12.x, NCCL.

Inference Engine: 100+ models with per-request pricing. Text models, video generation ($0.03-$0.50/request), image generation ($0.007-$0.134/request), and audio TTS ($0.005-$0.10/request). No GPU provisioning required.

Teams should run the normalization math above against their own workload before committing. Check gmicloud.ai/pricing for current rates.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started