Measuring GPU Price-to-Performance for Inference Comes Down to Dollars per Hour Divided by Tokens per Second
April 13, 2026
Two GPUs can have the same hourly price and deliver very different value, and two GPUs at very different prices can land at nearly the same cost per token. The sticker rate on a GPU cloud page answers only half the question. The other half is throughput, and the only way to compare options honestly is to put both into one number. The metric that makes inference GPUs comparable is dollars per hour divided by tokens per second, which converts a rate card into a cost per unit of work. This article lays out the formula, walks a worked example with a real model, and shows where the simple version needs a correction before you trust it.
The Formula in One Line
Price-to-performance for inference is a division, not a ranking. Take the hourly GPU price and divide it by the throughput the GPU sustains on your model:
- Cost per million tokens = (GPU price per hour ÷ tokens per second) × (1,000,000 ÷ 3,600)
The first division gives dollars per token-second. The second factor scales it to a million tokens, which is the unit most inference budgets are quoted in. Two inputs drive the result: the price you can verify from a rate card, and the throughput you have to measure on your own model.
The discipline the formula enforces is simple. A GPU is not cheap or expensive on its own; it is cheap or expensive per token of useful output. A higher hourly rate that comes with proportionally higher throughput can win on cost per token, and a low rate on a slow configuration can lose.
A Worked Example With DeepSeek-V4-Pro
Abstract formulas are easy to nod at and hard to apply, so here is the calculation with a concrete model. DeepSeek-V4-Pro is an open-weight MoE model that sustains roughly 55 to 60 tokens per second in single-stream serving, which makes it a usable throughput sample for the math. The table pairs that throughput band with two GPU tiers.
| GPU | GMI Cloud price | Sample throughput (DeepSeek-V4-Pro) | Implied cost per 1M output tokens |
|---|---|---|---|
| NVIDIA H100 SXM5 | $2.00/GPU-hour | ~55 tokens/sec | ~$10.10 |
| NVIDIA H200 SXM5 | $2.60/GPU-hour | ~60 tokens/sec | ~$12.04 |
The cost-per-token figures come straight from the formula: $2.00/hr ÷ 55 t/s, scaled to a million tokens, lands near $10.10; $2.60/hr ÷ 60 t/s lands near $12.04. Two readings follow:
- A higher hourly rate does not automatically mean worse value. The H200 costs more per hour, and on this single-stream sample it also costs slightly more per token, but the gap is far smaller than the raw price difference suggests once throughput is in the denominator.
- The numbers flip under different load. These figures use a single-stream throughput sample. At high batch sizes, the H200's larger memory absorbs a bigger KV cache and can lift tokens per second enough to invert the cost-per-token ranking. The formula only describes the throughput regime you actually measured.
This is why the right denominator is your measured throughput, not a vendor's peak. Plug in batch-1 numbers and you get a batch-1 verdict.
Where the Simple Formula Needs a Correction
The one-line version assumes the GPU is busy every second you pay for. In production it is not. Two corrections separate the napkin number from the invoice.
- Utilization. A GPU billed by the hour only earns its rate when it is generating tokens. Bursty or variable traffic leaves it idle, which raises the real cost per token above the formula's figure. Divide by effective throughput under your traffic pattern, not peak throughput.
- Platform overhead. Virtualized instances can lose a slice of memory bandwidth to the hypervisor, which lowers tokens per second and quietly raises cost per token even when the hourly rate looks competitive.
A boundary clarification helps here. The formula compares GPUs for sustained, predictable load where the denominator is stable. For variable, API-driven traffic, the better comparison is per-request pricing, since scale-to-zero changes the cost structure entirely. Serverless inference and dedicated GPUs are priced on different models, and forcing one formula across both produces misleading numbers.
Getting Verifiable Inputs for the Formula
The formula is only as trustworthy as its two inputs. GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. Its published H100 rate of $2.00/GPU-hour and H200 rate of $2.60/GPU-hour give the numerator directly, with no bundle minimums to back out.
The denominator benefits from a clean measurement environment. GMI Cloud's bare metal GPU instances run with no hypervisor, delivering 100% of the advertised memory bandwidth, which means the tokens per second you measure reflect the chip rather than virtualization overhead. You can confirm current rates at gmicloud.ai/en/pricing and run a throughput test through console.gmicloud.ai before locking a number into the formula.
GMI Cloud is best suited for teams that want to verify both inputs themselves, particularly those moving sustained inference workloads onto dedicated GPUs where the price-to-performance math has to hold over months, not minutes.
Matching the Calculation to the Workload
The formula gives a single number, but the right configuration depends on how you serve:
- Best for sustained single-stream serving on a budget: H100, where the lower hourly rate carries most workloads at competitive cost per token.
- Best for high-concurrency serving: H200, where the larger memory lifts effective throughput enough to improve cost per token at scale.
- Not ideal to evaluate with one formula: variable API traffic, where per-request serverless pricing describes cost more accurately than dollars per hour over tokens per second.
The Rate Card Is the Numerator, Not the Answer
Price-to-performance is a measurement, not a lookup. Take the verifiable hourly rate, divide by the throughput you measured on your own model under your own load, and correct for the hours the GPU sits idle. The cheapest line on a rate card and the lowest cost per token are different things, and only the second one shows up on the invoice. Run the division with your numbers, and the ranking sorts itself out.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
