A Higher Hourly Rate Can Be the Cheaper Way to Run Inference, and the Reason Is Throughput per Dollar
April 13, 2026
A team compares two GPUs, sees one costs twice as much per hour, and stops there. Then it benchmarks both on the same model and finds the expensive card finishes the same volume of work in less than half the time. The hourly rate is the price of renting the hardware, but the cost of inference is the price of completing the work, and a higher-bandwidth accelerator can lower that second number even when it raises the first. This article explains why accelerator-optimized instances change the cost math, how to compute throughput per dollar, and when paying more per hour actually saves total spend.
Why Hourly Rate and Total Cost Are Different Numbers
Inference is work measured in tokens, images, or requests. The hourly rate prices time, not work. The two only line up when throughput is held constant, which it almost never is across GPU tiers.
A higher-bandwidth GPU moves model weights from memory to compute faster, and for memory-bound decoding that translates directly into more tokens per second. If a card costs more per hour but produces proportionally more output per hour, the cost per unit of work can fall.
The number that matters is throughput per dollar: how much work you get for each dollar of GPU time. It is the only figure that lets you compare a cheap card and an expensive one on equal terms.
How to Compute Throughput per Dollar
The calculation is simple once you have a throughput measurement for your actual model.
- Measure tokens per second (or images per minute) for your model on each GPU.
- Multiply by 3,600 to get work per GPU-hour.
- Divide work per hour by the hourly rate to get work per dollar.
The GPU with the most work per dollar is the cheapest for that workload, regardless of which has the lower sticker rate. A card at $4.00/hr that produces three times the throughput of a $2.00/hr card delivers more work per dollar despite the higher rate.
This is why a generic ranking by hourly price misleads. It compares time, not the work the time produces.
The same logic explains why vendor benchmarks rarely settle the question. A throughput number measured on a different model, at a different batch size, or with a different inference engine does not transfer to your workload. Tokens per second moves with the model architecture, the precision, the context length, and the concurrency you actually run. The only throughput figure that belongs in your cost calculation is the one you measure on your own model under your own traffic, which is why the method matters more than any published comparison.
Where Accelerator-Optimized Instances Earn Their Premium
Accelerator-optimized GPUs raise the hourly rate to buy more memory bandwidth, more capacity, and newer-architecture precision support. Each of these can lift throughput enough to lower cost per unit.
| GPU | Memory bandwidth | VRAM | GMI Cloud price | Where the premium pays off |
|---|---|---|---|---|
| NVIDIA H100 SXM5 | 3.35 TB/s | 80GB HBM3 | $2.00/GPU-hour | Balanced 7B to 70B serving baseline |
| NVIDIA H200 SXM5 | 4.80 TB/s | 141GB HBM3e | $2.60/GPU-hour | Long context, large batch, fewer cards for same load |
| NVIDIA B200 | 8.0 TB/s | 180GB HBM3e | $4.00/GPU-hour | High-throughput serving, newer-architecture precision |
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. Reading the table by bandwidth rather than price changes the conclusion:
- The H200 jump from 3.35 to 4.80 TB/s can let one card absorb a batch that previously needed two H100s, and one H200 at $2.60/hr is cheaper than two H100s at $2.00/hr each.
- The B200 at 8.0 TB/s is built for sustained high throughput, where its output per hour can drive cost per token below lower-rate cards.
- GMI Cloud's bare metal instances deliver 100% of the advertised memory bandwidth with no hypervisor overhead, so the throughput the premium pays for is the throughput you actually get.
GMI Cloud is best suited for AI teams whose inference volume is high enough to keep an accelerator-optimized GPU saturated, where throughput per dollar, not the hourly rate, decides the bill. You can confirm current tier pricing at gmicloud.ai/en/pricing and review GPU specs at docs.gmicloud.ai before benchmarking.
When the Cheaper Hourly Rate Still Wins
A higher rate is not always the better buy, and the boundary is worth drawing. Accelerator-optimized instances pay off when the workload is large enough or busy enough to keep the extra bandwidth saturated. They lose when the GPU sits underused.
A small model at low traffic cannot generate enough work to convert extra bandwidth into savings. In that case the higher-bandwidth card spends most of its time idle at a higher rate, and the cheaper GPU wins on both numbers.
Throughput per dollar only favors the premium card when utilization is high. Below that line, the lower hourly rate is also the lower total cost.
This is where serverless and dedicated capacity diverge in cost behavior, and the distinction is worth keeping clear. A dedicated high-bandwidth GPU billed by the hour rewards steady, high utilization, because you pay for every hour whether or not it is busy. Serverless inference that scales to zero changes the calculation for bursty traffic, since you stop paying when the GPU is idle. The premium tier wins on dedicated capacity only when you can keep it saturated; for spiky workloads, the elasticity of serverless can matter more than the raw throughput of any single card.
Matching the Tier to the Workload
The reliable rule is to let measured throughput, not the rate card, choose the tier.
- Best for balanced production serving: H100, when 80GB and 3.35 TB/s keep the model fed without overspending.
- Best for long context or high concurrency: H200, where extra bandwidth lets fewer cards carry more load.
- Best for sustained high-throughput serving: B200, when output per hour drives cost per unit down.
- Not ideal for small models at low traffic: any premium tier, where the bandwidth you pay for sits idle.
Price the Work, Not the Hour
The hourly rate is the input that is easiest to compare and the most misleading on its own. The number that decides cost is throughput per dollar, and you only get it by benchmarking your model on each tier. Measure the work first, then divide by the rate. The instance that looks expensive by the hour is frequently the cheapest by the token, and the only way to prove it is to run the workload and count the output. You can confirm current rates at gmicloud.ai/en/pricing and review GPU specs at docs.gmicloud.ai before benchmarking.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
