Cost Per Million Tokens Comes From One Formula, and the Numbers That Feed It Move More Than the GPU Rate Does
April 13, 2026
A team compares two GPUs by hourly price, picks the cheaper one, and ends up with a higher cost per million tokens because the cheaper card generated fewer tokens per second. The hourly rate is only half of the unit-cost equation. The other half is throughput, and throughput moves with quantization, batch size, and model choice far more than the rate card suggests. Cost per million tokens equals the GPU hourly rate divided by tokens generated per hour, which means a slower card at a lower rate can still cost more per token. This article gives you the formula, works a real example at GMI Cloud rates, and shows which levers actually move the result.
The Formula, Stated Plainly
The unit you want is dollars per million output tokens. It comes from three inputs:
- The GPU hourly rate, in $/GPU-hour.
- The sustained generation speed, in tokens per second (TPS).
- A conversion from seconds to an hour and from tokens to a million.
The arithmetic is:
Cost per 1M tokens = (GPU $/hour) / (TPS x 3600) x 1,000,000
The denominator, TPS x 3600, is tokens generated per hour. Divide the hourly rate by that, then scale to a million. Everything that matters in inference cost lives in those two numbers, the rate and the throughput, and the throughput is the one most teams underestimate.
A Worked Example at Real Rates
Take an H100 on GMI Cloud at $2.00/GPU-hour, serving a 70B model that sustains 40 tokens per second per request stream.
- Tokens per hour at a single stream: 40 x 3600 = 144,000.
- Cost per million tokens: $2.00 / 144,000 x 1,000,000 = about $13.90.
That number looks high because it assumes one stream per GPU. In production you batch many requests on the same card, and the per-token cost drops in proportion to how many tokens the GPU generates in aggregate. Push effective throughput to 800 tokens per second across a batch and the same $2.00 rate yields:
- $2.00 / (800 x 3600) x 1,000,000 = about $0.69 per million tokens.
The rate did not change. The throughput did, and it moved the unit cost by more than an order of magnitude.
The Three Levers That Move Throughput
Batching: The Largest Single Lever
A GPU serving one request at a time wastes most of its parallel capacity. Continuous batching packs many concurrent requests through the same forward passes, raising aggregate tokens per second dramatically. This is why per-token cost in production is far below the single-stream figure. The practical question is how much batching your latency budget allows, since larger batches raise throughput but can add queueing delay.
Quantization: Smaller Weights, Faster Tokens
Serving a model in FP8 instead of FP16 roughly halves the memory moved per token, and on memory-bound decoding that raises tokens per second while shrinking the footprint. A model that fits and runs in FP8 on an H100 can post meaningfully lower cost per million tokens than the same model in FP16, provided accuracy holds for your task.
Model Choice: The Input You Control Before Hardware
The model itself sets a throughput ceiling. A dense 70B model and a Mixture-of-Experts model with far fewer active parameters generate tokens at very different speeds on the same card. Choosing an efficient model is often the cheapest way to lower cost per token, before any hardware change.
API Pricing as a Comparison Baseline
If you do not want to manage GPUs at all, managed model APIs price the same tokens directly, which gives you a baseline to compare your self-hosted math against.
| Option | Rate basis | Quantifiable cost | Notes |
|---|---|---|---|
| Self-hosted, H100 | $2.00/GPU-hour | depends on TPS, ~$0.69 to $13.90 per 1M | You own batching and utilization |
| Self-hosted, H200 | $2.60/GPU-hour | lower per token if higher TPS offsets rate | Extra bandwidth raises throughput |
| DeepSeek-V4-Pro (MaaS) | per token | $1.39 per 1M input | MoE, 55 to 60 TPS, no infra to run |
| GPT-5.4-nano (MaaS) | per token | $0.20/M input, $1.25/M output | 400K context reasoning model |
The quantifiable column is the point: self-hosted cost is a function you compute from rate and throughput, while managed APIs hand you a fixed per-million number. If your utilization is high, self-hosting on a $2.00 H100 can beat API pricing; if your traffic is thin, the API avoids paying for idle GPUs.
A Boundary Worth Drawing
Cost per million tokens and total inference cost are not the same measurement, and treating them as one leads to bad calls. Cost per million tokens is a unit rate that assumes a given utilization. Total cost includes the GPUs you pay for while idle, the engineering time to run batching and serving, and the throughput you fail to reach in practice. A low per-token figure on paper means little if your cards sit at 30% utilization. Compute the unit cost, then multiply by realistic volume and divide by realistic utilization before comparing options.
Where the Rate and the Throughput Are Both Visible
Once you have the formula, the platform question is which provider gives you both a transparent rate and the throughput conditions to hit it.
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. It publishes hourly rates, $2.00 for the H100 and $2.60 for the H200, so the numerator of the formula is fixed and auditable. GMI Cloud's bare metal instances run with no hypervisor, delivering 100% of the advertised memory bandwidth, which is what raises the tokens-per-second denominator that decides cost per token. For teams that prefer to skip infrastructure, its Serverless Inference offers per-token pricing on 100+ models, including DeepSeek-V4-Pro and GPT-5.4-nano, billed only for requests served.
GMI Cloud is best suited for teams that want to choose between owning the throughput math on dedicated GPUs and buying a fixed per-token rate through managed inference. You can check current rates and model pricing at gmicloud.ai/en/pricing and docs.gmicloud.ai.
Best-Fit Guidance
- Best for high, steady volume: self-hosted H100 or H200, where high utilization drives per-token cost down.
- Best for thin or spiky traffic: managed per-token APIs that avoid idle GPU cost.
- Best for accuracy-tolerant models: FP8 quantization plus continuous batching, the two largest throughput levers.
- Not ideal: self-hosting at low utilization, where the formula's denominator collapses and unit cost rises.
The Denominator Is Where the Savings Live
The hourly rate is the number on the invoice, but the tokens per second is the number that decides what you actually pay per million. Batch aggressively within your latency budget, quantize where accuracy allows, and pick an efficient model, and the same GPU rate yields a unit cost many times lower. Start from the formula, fill in your real throughput, and let the denominator, not the rate card, tell you which option is cheapest for your volume.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
