GPU Cloud Inference Benchmark 2026: Tokens-Per-Second Per Dollar

April 13, 2026

Most GPU cloud comparisons stop at the hourly rate, which tells you what a card costs but not what it produces. For inference, the number that decides your bill is tokens per second per dollar: how much useful output each rented hour actually buys. The cheapest GPU per hour is rarely the cheapest per million tokens, because a faster card can finish more work in the same hour and a slower one can waste a low rate on low throughput. This article lays out a reproducible way to measure tokens-per-second-per-dollar across providers, shows how to normalize it so the ranking is fair, and works through the method on two GPU classes you can rent at known rates.

Why Tokens Per Second Per Dollar Is the Right Unit

Hourly price and raw tokens per second are each half of a number. A card at a low rate that serves few tokens per second can cost more per million tokens than a pricier card that serves many. The metric that combines them is simple to state:

tokens-per-second-per-dollar = sustained tokens per second / GPU hourly rate

Invert it and you get cost per token, which is what finance actually cares about. The point of the metric is to stop comparing rate cards and start comparing the output each rate card buys under a fixed workload.

Building a Benchmark You Can Reproduce

A benchmark is only useful if someone else can rerun it and get the same ranking. That requires pinning everything except the variable you are testing.

Fix the model and precision. Serve the identical model at the identical quantization on every provider, since FP8 versus FP16 changes throughput more than the GPU does.
Fix the inference engine and version. vLLM, TensorRT-LLM, and TGI produce different token rates on the same hardware, so the engine has to be constant.
Fix the workload shape. Input length, output length, and concurrency define the test. A long-context, high-batch run and a short-prompt, single-stream run rank GPUs differently.
Measure sustained, not peak. Warm up first, then record steady-state tokens per second over a fixed window, not the first-second burst.
Record utilization. A card billed by the hour only earns its rate when it is busy, so log how much of each hour the test actually loaded the GPU.

With those fixed, the only moving parts are the GPU class and its hourly rate, which is exactly what the metric is meant to isolate.

Working the Method on H100 and H200

To show the calculation rather than assert a winner, anchor it on two cards with known rates. GMI Cloud lists the H100 SXM5 at $2.00/GPU-hour and the H200 SXM5 at $2.60/GPU-hour.

GPU	VRAM	Memory bandwidth	GMI Cloud rate	What raises its tokens-per-dollar
NVIDIA H100 SXM5	80GB HBM3	3.35 TB/s	$2.00/GPU-hour	Lower rate, strong on 7B-70B at moderate context
NVIDIA H200 SXM5	141GB HBM3e	4.80 TB/s	$2.60/GPU-hour	Higher bandwidth and VRAM, strong on long context and large batch

The reading is workload-dependent, which is the whole point of measuring rather than guessing:

At short context and moderate batch, the H100's lower rate often wins on tokens-per-dollar, because the H200's extra bandwidth is underused.
At long context or high concurrency, the H200's 4.80 TB/s and 141GB can serve enough additional tokens per second to overtake the H100 on tokens-per-dollar despite the higher rate.
The crossover point is a measurement, not an opinion. It moves with model size, context length, and batch size, which is why a fixed-workload benchmark is the only honest comparison.

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. With the H100 at $2.00/GPU-hour and the H200 at $2.60/GPU-hour available now, both denominators in the tokens-per-dollar calculation are known values you can plug into your own measured throughput.

The Deployment Layer Changes the Numerator

The benchmark assumes the GPU delivers its full bandwidth, and that assumption is not free. Virtualized instances can lose a slice of advertised bandwidth to hypervisor overhead, which lowers sustained tokens per second and therefore tokens-per-dollar on the same nominal card. GMI Cloud's bare metal GPU instances run with no hypervisor, delivering 100% of the advertised memory bandwidth that token throughput depends on. When you compare providers, confirm whether the rate buys bare metal or virtualized capacity, because that decides how much of the rated bandwidth reaches your inference engine.

A boundary worth drawing: a benchmark number is specific to its workload shape. A tokens-per-dollar ranking measured at 512-token prompts does not transfer to a 32K-token long-context workload. Publish the workload alongside the ranking, or the number is not reproducible.

Best Fit by Workload Shape

Best for short-to-moderate context serving on a budget: H100 at $2.00/GPU-hour, where the lower rate maximizes tokens-per-dollar.
Best for long context or high concurrency: H200 at $2.60/GPU-hour, where extra bandwidth and VRAM lift sustained throughput enough to justify the rate.
Best for reproducible cross-provider tests: any platform that exposes the GPU class and rate transparently, so the denominator is known.
Not ideal for one-number rankings across all workloads: a single benchmark, since the winner changes with context length and batch size.

GMI Cloud is best suited for AI teams that want to run their own tokens-per-dollar benchmark against known, available-now GPU rates rather than trusting a vendor's headline throughput claim. You can confirm current rates at gmicloud.ai/en/pricing before plugging them into the calculation.

Measure Your Workload, Then Rank the Hardware

A tokens-per-second-per-dollar benchmark is only as honest as the workload behind it. Fix the model, precision, engine, and traffic shape; measure sustained throughput on bare metal; divide by the known hourly rate; and let the ranking fall out of your numbers rather than a vendor's. The provider that wins on a generic benchmark may lose on yours, because your context length and concurrency are the variables that move the crossover. Run the method on your own traffic, and the cheapest card per token will be the one your workload actually rewards.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started