Other

A Llama 70B Throughput Number on an H100 Means Little Until You Pin Down What Was Being Measured

April 13, 2026

Someone quotes a tokens-per-second figure for Llama 70B on an H100, and a team uses it to model its inference bill. Then production traffic arrives and the real number is half of what was promised. The gap is not dishonesty. It is that throughput depends on quantization, batch size, context length, and whether the figure counts one stream or many in parallel. A throughput benchmark is only a cost basis if you know the precision, the batch size, and the concurrency it was measured under. This article unpacks what moves Llama 70B throughput on an H100, then shows how to turn a defensible number into a per-million-token cost.

Why One Number Is Never Enough

Tokens per second is a single figure standing in for several decisions. Change any one of them and the number moves, sometimes by a factor of two or more. Before quoting or trusting a benchmark, four variables have to be named.

Quantization Changes Both Speed and Fit

Llama 70B in FP16 needs roughly 140GB for weights, which does not fit on a single 80GB H100. In practice the model is served quantized. An FP8 version drops to around 70GB and fits on one card with room for a modest KV cache, and lower precision also raises effective throughput because there are fewer bytes to move per token. So almost every real H100 throughput number for Llama 70B is a quantized number, and quoting it without the precision is meaningless.

Batch Size Trades Latency for Throughput

Inference throughput and latency pull in opposite directions. A single request gives the lowest latency but leaves the GPU underused. Batching many requests together raises total tokens per second sharply, because the GPU processes them in parallel, but each individual request waits longer. A high aggregate throughput figure almost always reflects a large batch, which is the right metric for cost-per-token but the wrong one for a latency-sensitive chat experience.

Context Length Loads the Cache

Longer prompts grow the KV cache and consume bandwidth that would otherwise generate tokens. A throughput number measured at 512 tokens of context will exceed the same setup at 8K context. The benchmark has to state the context length or it cannot be reproduced.

Reading Throughput as a Cost Basis

The value of a defensible throughput number is that it converts an hourly GPU rate into a per-token cost. The table below shows the structure of that conversion, using the quantifiable columns that drive it.

Variable Typical setting for cost benchmarking Effect on tokens/sec What it determines
Precision FP8 quantized Higher than FP16 Whether 70B fits on one 80GB H100
Batch size Large, throughput-oriented Much higher aggregate Cost per million tokens
Context length Stated, e.g. 2K Lower as it grows Reproducibility of the number
GPU rate $2.00/GPU-hour (H100) n/a The denominator of cost-per-token

The conversion itself is simple once the inputs are fixed. Aggregate tokens per second times 3,600 gives tokens per GPU-hour. Divide the hourly rate by that, scaled to a million, and you have a cost per million tokens. At GMI Cloud's H100 rate of $2.00/GPU-hour, every doubling of effective throughput halves that cost, which is why batch size and quantization dominate the economics far more than the sticker rate does.

An Open-Model Reference Point

For teams comparing self-hosted Llama 70B against managed model endpoints, an open-weight reference helps anchor expectations. DeepSeek-V4-Pro, available on GMI Cloud's serverless inference at $1.39 per million input tokens, is an MoE model that activates 49B of its 1.6T parameters per token and serves at roughly 55 to 60 tokens per second. It is a useful comparison class because it shows the alternative to renting a GPU and running the throughput math yourself: paying per token for a managed endpoint where the provider absorbs the batching and quantization decisions. The choice between them is the real decision behind any throughput benchmark.

The Boundary Between a Benchmark and a Bill

A throughput benchmark and a production cost are related but not the same, and treating them as identical is where budgets break. A benchmark is a peak figure measured under chosen conditions, usually large batches and short context. A production bill reflects real traffic, which includes idle time between requests, latency caps that limit batch size, and long prompts that load the cache. The benchmark sets the ceiling; utilization sets what you actually pay. A GPU billed by the hour only reaches its benchmarked cost-per-token when it stays busy, so bursty traffic with idle gaps pays more per real token than the benchmark implies.

Where to Run the Benchmark and the Workload

Once you have a throughput number you trust, the next question is where to run Llama 70B so the measured economics hold in production.

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The H100 at $2.00/GPU-hour is available now with CUDA 12.x, TensorRT-LLM, and vLLM preconfigured, validated against NVIDIA Reference Architecture and backed by a 99.99% platform availability SLA. GMI Cloud's bare metal H100 instances run with no hypervisor, so a benchmark measures the full advertised bandwidth rather than a virtualized fraction, which keeps the lab number and the production number close.

For traffic that does not stay busy enough to justify a dedicated card, GMI Cloud's serverless inference scales to zero, which protects the cost-per-token math from idle time. You can confirm current pricing and the model library at gmicloud.ai/en/pricing and console.gmicloud.ai.

Matching the Measurement to Your Real Traffic

The right way to use a Llama 70B throughput number depends on what your traffic looks like.

  • Best for steady, batchable traffic: a dedicated H100, where large batches push cost-per-token toward the benchmark.
  • Best for bursty or unpredictable traffic: serverless or a managed endpoint, where scale-to-zero beats an idle owned card.
  • Best for a quick self-hosting cost estimate: an FP8 70B benchmark at stated batch and context, converted with the $2.00 rate.
  • Not ideal for latency-critical single-stream chat: a throughput-maximized large-batch configuration, which trades per-request latency for aggregate speed.

Pin the Conditions Before You Trust the Number

A tokens-per-second figure for Llama 70B on an H100 is only as useful as the conditions attached to it. The reliable path is to fix precision, batch size, and context length first, derive cost-per-token from there, and then check your real utilization against the assumed busy time. The benchmark is the start of the estimate, not the end of it. The number that survives contact with production is the one whose conditions you wrote down.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started