Other

MLPerf Inference Numbers Describe a Tuned Lab Run, Not the Throughput You Will See in Production

April 13, 2026

A team reads that an H100 system posted tens of thousands of tokens per second on an MLPerf submission, sizes its fleet against that figure, and then watches real traffic deliver a fraction of it. The gap is not a vendor lie. MLPerf measures a carefully tuned configuration under fixed conditions, and your production workload almost never matches those conditions. MLPerf is a reliable way to compare GPUs against each other, and an unreliable way to predict your own throughput without adjusting for batch size, sequence length, and serving scenario. This article explains what the MLPerf Inference numbers actually report, how to translate a leaderboard result into a planning estimate for your own model, and which GPU classes map to the submission tiers most teams care about.

What MLPerf Inference Actually Measures

MLPerf Inference is a benchmark suite from MLCommons. It defines a set of reference models and tasks, then measures how fast a submitted hardware and software system completes them under standardized rules. The result is comparable across vendors because everyone runs the same task definition.

Two details decide whether a number transfers to your situation.

The first is the scenario. MLPerf reports several, and they are not interchangeable.

  • Offline measures maximum throughput when all queries are available at once. This is the largest, most quotable number, and it assumes batching freedom you rarely have with live traffic.
  • Server measures throughput while meeting a latency bound on Poisson-distributed arrivals. It is closer to an API serving pattern and almost always lower than Offline.
  • Single-stream and multi-stream measure latency-sensitive paths relevant to edge and interactive use.

The second is the model and sequence length. A throughput figure is tied to a specific reference model at a specific input and output length. Change the model size or push the context window and the number moves, often by a lot, because longer sequences enlarge the key-value cache and shift the workload from compute-bound to memory-bound.

How to Translate a Leaderboard Result Into Your Own Estimate

A published tokens-per-second figure is a starting anchor, not a forecast. Three adjustments turn it into something you can plan against.

Match the Scenario to Your Traffic

If you serve a live API, read the Server result, not the Offline one. Offline throughput describes a batch job with no latency constraint. Using it to size an interactive endpoint overstates your capacity, sometimes by a wide margin.

Normalize for Sequence Length

MLPerf reports its language tasks at defined input and output lengths. If your prompts and completions are longer, expect lower throughput, because each request occupies memory longer and the KV cache grows. Scale your estimate down in rough proportion to how much your total sequence length exceeds the benchmark's.

Normalize for Precision and Batch

Submissions are tuned: optimal batch size, the lowest precision the task permits, a serving stack tuned for that exact run. If your production stack runs higher precision or smaller batches to hold latency down, your effective throughput drops below the headline. Treat the leaderboard number as a ceiling and budget below it.

A practical rule: take the Server-scenario result at a sequence length near yours, then discount it for the precision and batch reality of your deployment. The remaining figure is a planning estimate, not a guarantee.

Mapping MLPerf Tiers to Rentable GPUs

MLPerf submissions cluster around a few GPU classes. The table below pairs each class with the inference profile it fits and the rate at which you can actually rent it, so a leaderboard result maps to a line item rather than a spec you cannot access.

GPU VRAM Memory Bandwidth MLPerf submission profile GMI Cloud price
NVIDIA H100 SXM5 80GB HBM3 3.35 TB/s Baseline modern inference tier, broad submission coverage $2.00/GPU-hour
NVIDIA H200 SXM5 141GB HBM3e 4.80 TB/s Higher throughput on memory-bound, long-sequence tasks $2.60/GPU-hour
NVIDIA B200 180GB HBM3e 8.0 TB/s Newer-architecture precision, top single-node results $4.00/GPU-hour

A few readings worth making explicit:

  • H100 is the comparison baseline. Most published results normalize against it, so it is the easiest anchor for a first estimate.
  • H200 widens the gap on memory-bound tasks. Its higher bandwidth and larger VRAM lift throughput most where the KV cache dominates, which is exactly where headline-to-production gaps are widest.
  • B200 leads single-node throughput through newer-architecture precision support, but only if your stack uses the formats that gain matters for.

Why the Same GPU Reports Different Numbers Across Platforms

A benchmark result belongs to a system, not just a chip. The same H100 can post different production throughput depending on the layer between the silicon and your model. Virtualized instances lose a slice of memory bandwidth to hypervisor overhead, and inference is memory-bound for most decoding, so that slice shows up directly as fewer tokens per second.

This is where the rented hardware has to match the benchmark's assumptions. GMI Cloud's bare metal GPU instances run with no hypervisor, delivering 100% of the advertised memory bandwidth that MLPerf-style throughput depends on. A leaderboard number measured on bare metal only transfers if you run on bare metal too.

Where to Run the GPU Class You Picked

Once MLPerf has told you which GPU class fits your model and latency target, the open question is where to run it without rebuilding your stack as traffic grows.

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The H100, H200, and B200 classes are available at the prices listed, validated against NVIDIA Reference Architecture and backed by a 99.99% platform availability SLA.

The platform keeps two needs separate, because conflating them is how MLPerf estimates go wrong in practice:

  • Serverless inference fits variable, API-based traffic where arrivals are unpredictable and scale-to-zero avoids paying for idle GPUs. This is the Server-scenario world.
  • Dedicated GPU clusters and bare metal fit sustained, high-throughput jobs where you can batch aggressively and approach Offline-style numbers.

GMI Cloud is best suited for AI teams that have used a benchmark to choose a GPU class and now need to deploy it at the same hardware fidelity the benchmark assumed. You can confirm current rates and the model library at gmicloud.ai/en/pricing and console.gmicloud.ai, and the documentation at docs.gmicloud.ai covers the serving stack.

Best for and Not Ideal for, Read Through MLPerf

  • Best for sizing modern LLM serving: H100, the common baseline most submissions normalize against.
  • Best for long-context or large-batch throughput: H200, where extra bandwidth absorbs a heavy KV cache.
  • Best for top single-node results with newer precision: B200, when your stack uses the formats it accelerates.
  • Not ideal for predicting interactive latency from Offline numbers: any GPU, because Offline throughput ignores the latency bound your users feel.

Read the Scenario Before You Read the Score

The headline number on an MLPerf submission is the most tuned figure in the report, and the least likely to match your traffic. The useful move is to find the Server-scenario result at a sequence length near yours, discount it for your precision and batch reality, and only then map it to a rentable GPU class. A benchmark earns its value when you read it as a comparison between chips, not as a promise about your invoice.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started