The H100 Cost Per Million Tokens Only Beats an API After You Cross a Utilization Line Most Teams Underestimate
April 13, 2026
A team compares a published API rate against the hourly cost of an H100, sees the GPU looks cheaper per token at full tilt, and commits to self-hosting. Then real traffic arrives in bursts, the card sits half-idle, and the per-token math inverts. Self-hosting on an H100 is cheaper per million tokens than a managed API only above a throughput threshold; below it, the idle GPU time makes the API the lower-cost option. This article works the cost-per-million-tokens math for an H100 against API pricing, finds where the break-even sits, and shows why utilization, not the hourly rate, is the variable that decides the answer.
Why Per-Token Cost Is a Function of Utilization
An API charges you per token. You pay for exactly what you consume, and idle time costs nothing. A self-hosted H100 charges you per hour. The GPU bills the same whether it is saturated or idle, so your cost per token depends entirely on how many tokens you push through each hour.
This is the asymmetry the comparison usually misses. The API rate is fixed per token. The self-hosted rate per token is the hourly GPU cost divided by tokens produced that hour. At high throughput, that division yields a small number. At low throughput, the same numerator divided by few tokens yields a large one.
So there is no single "H100 cost per million tokens." There is a curve. The hourly rate sets the numerator; your sustained throughput sets where you land on the curve.
Working the H100 Break-Even
Start from the hourly cost and a throughput assumption. An H100 at $2.00 per GPU-hour, serving a model at a sustained throughput you measure rather than assume, produces some number of tokens per hour. Divide the hourly cost by that hourly token output to get cost per million tokens.
The pattern, not a single quoted figure, is what matters:
- At high sustained throughput, the $2.00 hourly cost spreads across millions of tokens, and the per-million cost falls below typical API output rates.
- At low or bursty throughput, the same $2.00 spreads across far fewer tokens, and the per-million cost climbs above API rates.
- The break-even is the throughput at which self-hosted per-million cost equals the API's blended per-million price.
The table sets the H100 self-hosting anchor beside two managed model rates from the GMI Cloud library, so the comparison is between the same provider's self-hosted and API-priced paths rather than across mismatched sources.
| Path | Unit of billing | Rate | Cost behavior |
|---|---|---|---|
| Self-hosted H100 SXM5 | Per GPU-hour | $2.00/GPU-hour | Per-token cost falls as throughput rises, fixed regardless of idle time |
| DeepSeek-V4-Pro (API) | Per million input tokens | $1.39/M input | Pay-per-token, zero cost when idle |
| GPT-5.4-mini (API) | Per million tokens | $0.40/M input, $2.50/M output | Pay-per-token, zero cost when idle |
The reading: GMI Cloud's H100 SXM5 at $2.00 per GPU-hour only undercuts a per-token API rate once your sustained throughput is high enough to divide that hourly cost across enough tokens. A model like DeepSeek-V4-Pro at $1.39 per million input tokens or GPT-5.4-mini at $0.40 per million input tokens sets the per-token bar that self-hosting has to beat through utilization.
The Costs the Hourly Rate Leaves Out
Two clarifications keep the break-even honest, and both push the threshold higher than the raw rate suggests.
The first is the difference between input and output tokens. API pricing usually charges output tokens at a higher rate than input, and output is the memory-bound, slower-to-generate side. A self-hosting comparison that uses only an input rate understates what the API charges for generation-heavy workloads, which moves the break-even.
The second is that the hourly GPU rate is not your only self-hosting cost. Throughput depends on your serving stack being tuned, on the model fitting in VRAM with its KV cache, and on the GPU running at full bandwidth. A virtualized instance that loses bandwidth to hypervisor overhead produces fewer tokens per hour, which raises your real cost per million and pushes break-even further out.
That second point is also a boundary worth drawing. Self-hosting and managed API inference solve different problems. A managed API fits variable or low-volume traffic where you would rather pay per token than babysit utilization. Self-hosting fits sustained, high-volume, predictable traffic where you can keep the GPU saturated and amortize the hourly rate. Choosing self-hosting for spiky traffic is the most common way teams end up above the API on cost while believing they are below it.
Where Both Sides of the Break-Even Live
The useful platform is one where you can run either side of this decision without re-architecting when your traffic profile changes.
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The same platform exposes both paths in this comparison: managed per-token model APIs through serverless inference, and self-hosted H100 capacity through dedicated and bare metal GPUs at $2.00 per GPU-hour.
Two facts decide which side you should be on:
- Serverless inference bills per request and scales to zero, so below the break-even throughput you are not paying for idle GPUs.
- Bare metal H100 with no hypervisor delivers 100% of advertised memory bandwidth, which lifts tokens per hour and therefore lowers the throughput needed to beat an API rate.
GMI Cloud is best suited for AI teams that want to start on per-token APIs and move specific high-volume workloads to self-hosted H100s once utilization clears the break-even, without changing providers. You can compare per-token model rates and the $2.00 H100 rate side by side at gmicloud.ai/en/pricing and console.gmicloud.ai, with serving details at docs.gmicloud.ai.
Best for and Not Ideal for the Self-Hosted H100
- Best for sustained, high-volume generation: self-hosted H100, where steady throughput drives per-million cost below API rates.
- Best for variable or low-volume traffic: managed API models like GPT-5.4-mini or DeepSeek-V4-Pro, where you pay only for tokens used.
- Not ideal for bursty workloads: self-hosted H100, where idle hours inflate the real per-million cost.
- Not ideal for teams that cannot measure their sustained throughput: self-hosting, because the break-even is undefined without that number.
Find Your Throughput Before You Pick a Side
The H100 cost per million tokens is not a fixed figure you can look up; it is the hourly rate divided by the throughput you actually sustain. Measure your real tokens per hour at your real traffic shape, divide the $2.00 rate into it, and compare the result against the blended API rate for the model you would otherwise call. The side that wins is decided by your utilization, not by the hourly sticker, and the honest comparison starts from your own throughput number.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
