Independent Speed Data for 2026: LLM Inference Speed Benchmarks on TTFT, Throughput, and Cost

May 28, 2026

Provider speed claims are mostly marketing benchmarks run on ideal conditions. The numbers on a landing page came from a warm instance, a short prompt, and one concurrent user.

Production tells a different story. Your real users hit cold pods, send 8K-token RAG contexts, and arrive in bursts that flatten throughput curves. The gap between marketing TPS and your p95 isn't rounding error, it's a budget risk that turns a "sub-second" chat into a three-second wait.

This article shows you how to read independent benchmarks, which four metrics predict user experience, and how to run a load test that survives your traffic.

The Four Metrics That Actually Matter

Before you compare providers, agree on what you're measuring. Most marketing tables conflate three different things and skip the one that breaks budgets.

Metric	What It Measures	Why It Matters
TTFT (Time to First Token)	Latency from request to first generated token	Chat UX. Sub-second is the bar users notice
Output throughput	Tokens streamed per second after the first one	Drives total time for long generations
Total request latency	TTFT + (output_tokens / throughput)	The real number for batch jobs and doc processing
Cost per million tokens	Dollars per 1M input + 1M output tokens	The pricing axis across providers

Why p50 lies and p95 doesn't

Most providers publish p50 (median) latency. That's the experience half your users beat. Production cares about p95 and p99, the slow tail that fills your support queue. A provider with a 0.4-second p50 and a 4-second p99 will look great in a demo and break in production. Always ask for the tail.

What Independent Benchmarks Actually Measure

Three public sources do the work providers won't. Each has a different lens, and you'll need all three to triangulate.

Artificial Analysis (artificialanalysis.ai): runs continuous, public latency and throughput tests against major hosted LLM APIs. Useful for cross-provider comparison on the same model.
MLPerf Inference (mlcommons.org): hardware-level benchmarks submitted by vendors under audited conditions. Useful for understanding what a GPU class can do, less useful for picking an API.
Provider-published numbers: treat as upper bounds, not expected values. They're real measurements, just on cherry-picked conditions.

The fourth source is your own load test. It's the only one that uses your prompts, your concurrency, your region, and your model size. Public benchmarks narrow the candidate list. Custom tests pick the winner.

The 2026 Provider Speed Landscape

Different providers optimize for different points on the speed curve. Here's how the field shapes up, with the methodology you'd use to verify any claim.

Provider	Speed Position (per public benchmarks)	What to Verify
Groq	LPU architecture, top output throughput on small-class models	TTFT under your prompt distribution, not just demo
Cerebras	CS-3 wafer-scale, high throughput for select models	Model availability for your use case
Fireworks AI	Strong throughput across hosted open models	Tail latency under burst load
Together AI	Wide model catalog, competitive throughput	Regional latency, sustained vs burst
DeepInfra	Cost-leaning, broader model selection	p95 latency, not just headline TPS
GMI Cloud	Inference Engine with 100+ models, per-request billing	Cross-check on Artificial Analysis when listed
OpenAI / Anthropic / Google	Proprietary models, vertically integrated stacks	Tier-dependent rate limits, region routing

Reading the table: Groq and Cerebras lead headline throughput numbers on the models they host, per Artificial Analysis. Fireworks, Together, and DeepInfra serve broader open-model catalogs and compete on a TTFT + cost-per-token tradeoff. Hyperscaler-owned APIs (OpenAI, Anthropic, Google) bundle latency with model quality, so you're rarely comparing speed alone.

Don't pick from this table. Use it to narrow the shortlist, then test.

Engineering Reality: What Marketing Benchmarks Don't Show

Public numbers come from clean conditions. Your traffic doesn't. Here's what bites in production.

Load profile mismatch. Marketing TPS uses a single user, fixed prompt length, and warm instance. Your traffic has a Zipfian prompt distribution, concurrent sessions, and cold-start spikes. Throughput drops 30 to 60 percent under realistic mixes.
Regional variance. A provider's "300 ms TTFT" is often us-east-1 to us-east-1. Add a Singapore client and you're at 600 to 900 ms before the model touches the prompt.
Time-of-day variance. Peak hours (10am to 4pm PT for US providers) compress capacity. Off-peak benchmarks routinely beat peak by 20 to 40 percent on the same endpoint.
Warm vs cold instances. First-token latency on a cold pod can be 5 to 10x a warm one. If your usage isn't constant, you'll see this every morning.
Setting up your own load test. Use k6, locust, or vegeta. Send 50 to 200 concurrent users, sample real prompt lengths from your logs, sustain 10 minutes per cell, and run three times across the day. Record p50, p95, p99 TTFT, output TPS, and full request latency.
What not to trust in marketing tables. Single-prompt TPS, "up to" framing, and any latency number without a region and concurrency disclosed.

If a provider can't tell you their p95 under 100 concurrent users in your region, you don't have a benchmark. You have a brochure.

How to Run a Defensible Speed Comparison

Five steps move you from marketing pages to a decision.

Define your traffic shape. Pull a week of production logs. Compute prompt-length p50/p95, concurrency p50/p95, and output-length distribution.
Pick three to five candidates. Start with Artificial Analysis for headline numbers on your target model class (small-class GPT mini, Gemini Flash class, DeepSeek's reasoning models, or whatever you're shopping).
Run identical tests. Same prompts, same concurrency, same region client, same 10-minute sustain. Vary nothing else.
Score on the metric that matches your workload. Chat apps weight TTFT. Batch pipelines weight total latency. Cost-sensitive workloads weight $/M tokens at p95.
Validate cost. Multiply observed token throughput by published per-token pricing. The cheapest sticker price often loses on $/successful-request when you factor in retries.

If GMI Cloud's Inference Engine is on your shortlist, you'd benchmark it the same way: pick a model from the 100+ catalog at gmicloud.ai, run the same load profile you use against other providers, and compare on TTFT, throughput, and per-request cost.

When to Trust a Benchmark, and When to Re-Run It

Independent benchmarks age fast. Providers ship optimizations, traffic patterns shift, and new hardware lands. A six-month-old Artificial Analysis chart is a starting point, not an endpoint.

Re-run your own tests when: you change models, you cross 10x your previous load, you expand into a new region, or a provider announces a major stack update. Otherwise, sample your production latency continuously through OpenTelemetry traces and alert on p95 drift.

For teams running open-model inference at scale, GMI Cloud's Inference Engine exposes per-request billing and pre-deployed model endpoints, which simplifies side-by-side benchmarking against other hosted APIs. You'd still want to validate with your own load test before committing serious volume.

FAQ

What's the difference between TTFT and total latency?

TTFT is the wait before the first token streams. Total latency is TTFT plus the time to stream every remaining token. Chat UX lives or dies on TTFT, while batch jobs and document processing care about total. Pick the metric that matches your workload.

Why do provider benchmarks always look better than my tests?

Marketing tests use a single concurrent user, short prompts, warm instances, and same-region clients. Your production hits multi-user load, long contexts, cold pods, and cross-region routing. The 30 to 60 percent gap is expected, not a bug.

Is Artificial Analysis a reliable source?

Artificial Analysis publishes continuous, methodology-disclosed tests across major LLM APIs and is the most cited public source in 2026. It's a strong starting point for cross-provider comparison. You should still re-test under your own conditions before picking a primary provider.

Should I optimize for cheapest cost per million tokens?

Not directly. Headline cost-per-token is a sticker price. The number that matters is dollars per successful request at your p95 latency target. A cheaper provider that fails 5 percent of requests under load costs more than a pricier one that holds at p99.

Source note: Speed claims should be cross-checked against Artificial Analysis (artificialanalysis.ai) and MLPerf Inference (mlcommons.org/benchmarks/inference-datacenter). For GMI Cloud Inference Engine model availability and per-request pricing, check gmicloud.ai/pricing for current rates.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started