other

Cost Per Million Tokens Explained: LLM Inference Cost on GPU Cloud

May 28, 2026

Most LLM cost calculators quote the GPU hourly rate as if your model hits peak throughput 100 percent of the time. Real production utilization sits closer to 30 to 60 percent, so cost-per-million-tokens estimates come in 2x to 3x higher than the spreadsheet predicted.

You'll see the gap when sprint budgets evaporate by week three, when the CFO asks why managed API quotes look cheaper, and when you can't tell if the issue is the GPU, the model, or the batch strategy. The formula itself is simple. What's hard is feeding it honest throughput.

This article walks through the cost equation, four worked H100 and H200 examples, the variables that move TPS by 3x, and when self-hosting beats per-request pricing.

The Formula in One Line

Here's the math everyone needs:

Cost per 1M tokens = (GPU $/hour) ÷ (TPS × 3,600) × 1,000,000

TPS is tokens per second your deployment actually serves, not the marketing peak. The 3,600 converts hours to seconds. Multiply by one million to get the standard pricing unit you'll compare against managed APIs.

Two things break this formula in practice. First, TPS isn't a single number. It depends on batch size, sequence length, FP precision, and KV-cache pressure. Second, utilization rarely hits 100 percent, so you should divide by your real duty cycle.

Four Worked Examples on H100 and H200

GMI Cloud lists H100 SXM at around $2.00 per GPU-hour and H200 SXM at around $2.60 per GPU-hour (check gmicloud.ai/pricing for current rates). The throughput numbers below are approximate and depend on TensorRT-LLM tuning, FP precision, batch size, and prompt length.

# Model class GPU Precision Approx TPS Cost per 1M tokens
A 70B dense 1x H100 FP8 ~85 ~$6.54
B 70B dense 1x H200 FP8 ~140 ~$5.16
C 70B-class (DeepSeek V3 / Qwen 72B) 1x H100 INT4 ~120 ~$4.63
D 7B dense 1x H100 FP8 ~400 ~$1.39

Example A math: $2.00 / (85 x 3,600) x 1,000,000 = $6.54 per 1M tokens.

Example B math: $2.60 / (140 x 3,600) x 1,000,000 = $5.16. Bigger memory bandwidth (4.8 TB/s vs 3.35 TB/s, per the NVIDIA H200 Product Brief 2024) lifts decode throughput on memory-bound workloads.

Example C math: $2.00 / (120 x 3,600) x 1,000,000 = $4.63. INT4 cuts weight reads, so a 70B-class model can clear the H100 faster than the FP8 version.

Example D math: $2.00 / (400 x 3,600) x 1,000,000 = $1.39. Small models flip the cost picture by an order of magnitude.

Why TPS Isn't One Number

Three variables move throughput hard, and ignoring them is where most spreadsheets blow up.

  • Quantization. FP8 halves weight memory versus FP16. INT4 halves it again. Decode speeds up proportionally on memory-bound steps.
  • Batch size. Batch 1 to batch 32 can lift aggregate TPS 5x to 10x, but per-request latency climbs.
  • Sequence length. Long prompts inflate KV-cache reads. A 2K decode is materially cheaper than 32K on the same GPU.

That's why the table above lists TPS as approximate. Your real number depends on the workload mix you actually serve.

Engineering Reality You Can't Skip

Engineers shipping this in production hit five things the formula hides.

KV-cache memory. Each request reserves cache proportional to layers x heads x head-dim x sequence length x precision. A Llama-70B request at 4K context, FP16 KV, eats roughly 0.4 GB. Run 32 concurrent requests and you've burned 12.8 GB before counting weights. Once VRAM fills, batch size collapses and TPS drops with it.

Batch sizing impact. vLLM and TensorRT-LLM continuous batching helps, but only if your traffic shape supports it. Spiky, low-concurrency workloads underfill batches and run closer to per-request throughput, which is often 3x to 5x worse than the published peak.

FP8 vs FP16. FP8 doubles effective memory bandwidth utilization on Hopper, which is why H100 and H200 quote FP8 numbers prominently (NVIDIA H100 Datasheet, 2023). Switching from FP16 to FP8 typically lifts TPS 1.5x to 1.8x on decode-bound models, with quality loss that depends on calibration.

Prompt-length variance. Prefill dominates time for long prompts. A workload averaging 8K input tokens behaves nothing like one averaging 500. Benchmark with your actual prompt distribution, not the vendor's reference scenario.

Utilization decay. Production duty cycle of 40 percent means your real cost per 1M tokens is 2.5x the formula output. Bake that in.

Self-Hosted vs Managed API: The Break-Even

Managed inference APIs typically run $0.50 to $5 per 1M tokens for small-class GPT mini variants and $5 to $30 per 1M tokens for frontier-class Claude, Gemini, or GPT models. DeepSeek's reasoning-class models often price below frontier tier but above nano tier. Self-hosting wins or loses on volume and utilization.

Scenario When self-host wins When managed API wins
7B model, steady 24/7 traffic Self-host (~$1.39 / 1M at 100%) Managed only if traffic is bursty
70B FP8, 40% duty cycle Break-even near $13-$16 / 1M Managed API at $5-$10 / 1M wins
Frontier-class Hard to self-host at scale Managed API almost always
Spiky, low-volume Idle GPU burns budget Per-request billing wins

The break-even math: take your formula output, divide by realistic utilization. If a managed API price sits below that adjusted number, the API is cheaper. If above, self-hosting on H100 or H200 wins on a per-million-tokens basis, assuming you can actually keep the GPU busy.

Where GMI Cloud Fits

GMI Cloud (gmicloud.ai) sits on both sides of this equation. On-demand H100 SXM at around $2.00 per GPU-hour and H200 SXM at around $2.60 per GPU-hour give you the formula inputs for self-hosted math. Pre-configured TensorRT-LLM, vLLM, and Triton stacks remove most of the throughput-tuning ramp.

For workloads where utilization stays low or traffic is bursty, the Inference Engine offers 100+ pre-deployed models behind per-request pricing. Run the formula, check your duty cycle, then pick the path that fits your traffic.

Bottom Line

The formula is one line. The honest version requires real TPS measured on your workload, real utilization over a representative week, and acceptance that batch size and quantization can swing your number 3x.

Calculate once with vendor peak TPS, calculate again with benchmarked TPS, then compare both against managed API quotes. The gap is usually where the build-vs-buy decision lives.

FAQ

How accurate is the cost-per-1M-tokens formula for production?

It's directionally correct but assumes 100 percent utilization. Multiply your formula output by (1 / duty cycle) to get a realistic production number. A 40 percent duty cycle means your real cost is 2.5x the formula output, which is usually where managed APIs start looking competitive.

What TPS should I use if I haven't benchmarked yet?

Start with vendor reference numbers, then haircut by 40 to 60 percent for your first estimate. Real TPS depends on batch size, sequence length, and prompt distribution. Run a 24-hour load test with your actual traffic shape before locking the budget.

Does H200 always beat H100 on cost per 1M tokens?

Often yes for 70B and larger models because the 4.8 TB/s memory bandwidth helps decode-bound workloads (NVIDIA H200 Product Brief, 2024). For 7B to 13B models that fit comfortably in H100's 80 GB VRAM, H100 at the lower hourly rate usually wins on pure cost per token.

When does self-hosting beat managed APIs?

When your GPU duty cycle clears roughly 50 to 70 percent and your model fits a single H100 or H200 with reasonable batching. Below that utilization, per-request billing on a managed inference platform almost always wins. Check current pricing at gmicloud.ai/pricing before committing either way.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started
Cost Per Million Tokens Explained: LLM Inference Cost on GPU Cloud | GMI Cloud