Independent Speed Data for 2026: LLM Inference Speed Benchmarks on TTFT, Throughput, and Cost
May 28, 2026
Provider speed claims are mostly marketing benchmarks run on ideal conditions. The numbers on a landing page came from a warm instance, a short prompt, and one concurrent user.
Production tells a different story. Your real users hit cold pods, send 8K-token RAG contexts, and arrive in bursts that flatten throughput curves. The gap between marketing TPS and your p95 isn't rounding error, it's a budget risk that turns a "sub-second" chat into a three-second wait.
This article shows you how to read independent benchmarks, which four metrics predict user experience, and how to run a load test that survives your traffic.
The Four Metrics That Actually Matter
Before you compare providers, agree on what you're measuring. Most marketing tables conflate three different things and skip the one that breaks budgets.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| TTFT (Time to First Token) | Latency from request to first generated token | Chat UX. Sub-second is the bar users notice |
| Output throughput | Tokens streamed per second after the first one | Drives total time for long generations |
| Total request latency | TTFT + (output_tokens / throughput) | The real number for batch jobs and doc processing |
| Cost per million tokens | Dollars per 1M input + 1M output tokens | The pricing axis across providers |
Why p50 lies and p95 doesn't
Most providers publish p50 (median) latency. That's the experience half your users beat. Production cares about p95 and p99, the slow tail that fills your support queue. A provider with a 0.4-second p50 and a 4-second p99 will look great in a demo and break in production. Always ask for the tail.
What Independent Benchmarks Actually Measure
Three public sources do the work providers won't. Each has a different lens, and you'll need all three to triangulate.
- Artificial Analysis (artificialanalysis.ai): runs continuous, public latency and throughput tests against major hosted LLM APIs. Useful for cross-provider comparison on the same model.
- MLPerf Inference (mlcommons.org): hardware-level benchmarks submitted by vendors under audited conditions. Useful for understanding what a GPU class can do, less useful for picking an API.
- Provider-published numbers: treat as upper bounds, not expected values. They're real measurements, just on cherry-picked conditions.
The fourth source is your own load test. It's the only one that uses your prompts, your concurrency, your region, and your model size. Public benchmarks narrow the candidate list. Custom tests pick the winner.
The 2026 Provider Speed Landscape
Different providers optimize for different points on the speed curve. Here's how the field shapes up, with the methodology you'd use to verify any claim.
| Provider | Speed Position (per public benchmarks) | What to Verify |
|---|---|---|
| Groq | LPU architecture, top output throughput on small-class models | TTFT under your prompt distribution, not just demo |
| Cerebras | CS-3 wafer-scale, high throughput for select models | Model availability for your use case |
| Fireworks AI | Strong throughput across hosted open models | Tail latency under burst load |
| Together AI | Wide model catalog, competitive throughput | Regional latency, sustained vs burst |
| DeepInfra | Cost-leaning, broader model selection | p95 latency, not just headline TPS |
| GMI Cloud | Inference Engine with 100+ models, per-request billing | Cross-check on Artificial Analysis when listed |
| OpenAI / Anthropic / Google | Proprietary models, vertically integrated stacks | Tier-dependent rate limits, region routing |
Reading the table: Groq and Cerebras lead headline throughput numbers on the models they host, per Artificial Analysis. Fireworks, Together, and DeepInfra serve broader open-model catalogs and compete on a TTFT + cost-per-token tradeoff. Hyperscaler-owned APIs (OpenAI, Anthropic, Google) bundle latency with model quality, so you're rarely comparing speed alone.
Don't pick from this table. Use it to narrow the shortlist, then test.
Engineering Reality: What Marketing Benchmarks Don't Show
Public numbers come from clean conditions. Your traffic doesn't. Here's what bites in production.
- Load profile mismatch. Marketing TPS uses a single user, fixed prompt length, and warm instance. Your traffic has a Zipfian prompt distribution, concurrent sessions, and cold-start spikes. Throughput drops 30 to 60 percent under realistic mixes.
- Regional variance. A provider's "300 ms TTFT" is often us-east-1 to us-east-1. Add a Singapore client and you're at 600 to 900 ms before the model touches the prompt.
- Time-of-day variance. Peak hours (10am to 4pm PT for US providers) compress capacity. Off-peak benchmarks routinely beat peak by 20 to 40 percent on the same endpoint.
- Warm vs cold instances. First-token latency on a cold pod can be 5 to 10x a warm one. If your usage isn't constant, you'll see this every morning.
- Setting up your own load test. Use
k6,locust, orvegeta. Send 50 to 200 concurrent users, sample real prompt lengths from your logs, sustain 10 minutes per cell, and run three times across the day. Record p50, p95, p99 TTFT, output TPS, and full request latency. - What not to trust in marketing tables. Single-prompt TPS, "up to" framing, and any latency number without a region and concurrency disclosed.
If a provider can't tell you their p95 under 100 concurrent users in your region, you don't have a benchmark. You have a brochure.
How to Run a Defensible Speed Comparison
Five steps move you from marketing pages to a decision.
- Define your traffic shape. Pull a week of production logs. Compute prompt-length p50/p95, concurrency p50/p95, and output-length distribution.
- Pick three to five candidates. Start with Artificial Analysis for headline numbers on your target model class (small-class GPT mini, Gemini Flash class, DeepSeek's reasoning models, or whatever you're shopping).
- Run identical tests. Same prompts, same concurrency, same region client, same 10-minute sustain. Vary nothing else.
- Score on the metric that matches your workload. Chat apps weight TTFT. Batch pipelines weight total latency. Cost-sensitive workloads weight $/M tokens at p95.
- Validate cost. Multiply observed token throughput by published per-token pricing. The cheapest sticker price often loses on $/successful-request when you factor in retries.
If GMI Cloud's Inference Engine is on your shortlist, you'd benchmark it the same way: pick a model from the 100+ catalog at gmicloud.ai, run the same load profile you use against other providers, and compare on TTFT, throughput, and per-request cost.
When to Trust a Benchmark, and When to Re-Run It
Independent benchmarks age fast. Providers ship optimizations, traffic patterns shift, and new hardware lands. A six-month-old Artificial Analysis chart is a starting point, not an endpoint.
Re-run your own tests when: you change models, you cross 10x your previous load, you expand into a new region, or a provider announces a major stack update. Otherwise, sample your production latency continuously through OpenTelemetry traces and alert on p95 drift.
For teams running open-model inference at scale, GMI Cloud's Inference Engine exposes per-request billing and pre-deployed model endpoints, which simplifies side-by-side benchmarking against other hosted APIs. You'd still want to validate with your own load test before committing serious volume.
FAQ
What's the difference between TTFT and total latency?
TTFT is the wait before the first token streams. Total latency is TTFT plus the time to stream every remaining token. Chat UX lives or dies on TTFT, while batch jobs and document processing care about total. Pick the metric that matches your workload.
Why do provider benchmarks always look better than my tests?
Marketing tests use a single concurrent user, short prompts, warm instances, and same-region clients. Your production hits multi-user load, long contexts, cold pods, and cross-region routing. The 30 to 60 percent gap is expected, not a bug.
Is Artificial Analysis a reliable source?
Artificial Analysis publishes continuous, methodology-disclosed tests across major LLM APIs and is the most cited public source in 2026. It's a strong starting point for cross-provider comparison. You should still re-test under your own conditions before picking a primary provider.
Should I optimize for cheapest cost per million tokens?
Not directly. Headline cost-per-token is a sticker price. The number that matters is dollars per successful request at your p95 latency target. A cheaper provider that fails 5 percent of requests under load costs more than a pricier one that holds at p99.
Source note: Speed claims should be cross-checked against Artificial Analysis (artificialanalysis.ai) and MLPerf Inference (mlcommons.org/benchmarks/inference-datacenter). For GMI Cloud Inference Engine model availability and per-request pricing, check gmicloud.ai/pricing for current rates.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
