other

TTFT vs Tokens Per Second: LLM Inference Speed Metrics TTFT Explained

May 28, 2026

A model that generates 278 tokens per second can still make users wait 15 seconds before seeing any response. A model that delivers the first token in under 200 milliseconds might generate the rest of the response at a pace that feels slow to read.These two metrics, Time to First Token (TTFT) and tokens per second (TPS), capture different parts of the latency experience, and the gap between them can be wide enough to invalidate a benchmark comparison if you only read one of the numbers.This piece defines what each metric measures, shows where they diverge on real models, and maps the right metric to the right application type.

What TTFT and TPS Actually Measure

Time to First Token (TTFT)is the elapsed time between sending a request and receiving the first token of the response. It measures responsiveness : how quickly the model starts producing output. For a user typing a question into a chat interface, TTFT is the pause before the first word appears. The prefill phase (processing the input prompt and building the KV cache) drives TTFT; longer input prompts increase it because more prefill work happens before generation begins.

Tokens per second (TPS)measures generation speed after the first token arrives. It captures throughput : how fast the remaining response streams. A model at 278 TPS generates roughly 208 words per second. A model at 50 TPS generates roughly 37 words per second, still faster than most people read.

These metrics are largely independent. A model can have low TTFT (fast to start) but low TPS (slow to finish). A model can have high TPS but high TTFT, generating quickly once it starts, but making users wait before it begins.

The practical reading threshold is approximately 10 tokens per second, equivalent to 450 words per minute. Above that, streaming output is fast enough that the human perception of lag shifts from TPS to TTFT. Below 50 TPS, responses can feel sluggish in interactive applications even with low TTFT.

For TTFT, under 1 second feels instant in chat interfaces. Under 200 milliseconds is a common SLO target for production chat endpoints where 95% of responses should meet that threshold.

Why the Same Model Can Look Fast and Feel Slow

The clearest real-world example of TTFT and TPS diverging comes from reasoning models.

Gemini 3.5 Flash, released May 19, 2026, generates output at 278 tokens per second on the first-party API at its standard thinking level, ranking second among all measured models by TPS. At its highest thinking level, TPS remains strong at 212 tokens per second.

The TTFT at high thinking level is 15.37 seconds.Users stare at a blank screen for more than 15 seconds before the first word appears, then see the response stream in quickly. At medium thinking level, TTFT drops below 5 seconds. At low thinking level, it drops further.

The cause is chain-of-thought processing. Reasoning models generate internal "thinking" tokens before producing the visible response. Those thinking tokens appear as TTFT from the user's perspective. The same model that leads on TPS looks unusably slow on TTFT in high-reasoning mode for any application where users are waiting at a screen.

GPT-5.4-nano, released March 17, 2026, has the same architectural pattern. As a reasoning model priced at $0.20 per million input tokens, it handles multi-step tasks more reliably than non-reasoning models at comparable prices. The reasoning overhead shows in TTFT. For classification tasks, simple generation, or any workload where chain-of-thought adds overhead without adding value, the TTFT cost is real and can exceed what the application's latency budget allows.

MiMo-V2.5-Pro, released by Xiaomi on April 22, 2026, shows a different profile. It generates at 53.3 tokens per second across providers (median), with a TTFT of 3.76 seconds under a 10,000-token input test condition. Both metrics are above their respective category medians: TPS is slightly below average for models of its size (59.9 t/s median), while TTFT is at the higher end (2.44s median). MiMo-V2.5-Pro uses Multi-Token Prediction (MTP) with three integrated modules, which Xiaomi reports as tripling output throughput during inference. The 1-trillion-parameter MoE architecture with 42 billion active parameters and hybrid attention (sliding window interleaved with global attention at a 6:1 ratio) produces a model with strong reasoning benchmark scores and a latency profile suited for non-interactive workloads.

Which Metric to Optimize By Application Type

Different applications have different bottlenecks:

  • Conversational chat and copilots: TTFT dominates. Users perceive lag before the first word, not between words once streaming starts. For these applications, a model with 1-second TTFT and 80 TPS usually feels faster than one with 10-second TTFT and 200 TPS, because the visible pause is the bottleneck.
  • Batch document processing and classification: TTFT is irrelevant. No human is waiting at a screen. Total throughput (TPS × clip length, or requests per second at scale) determines cost and turnaround time. A reasoning model with 15-second TTFT and 200 TPS processes documents faster than a non-reasoning model with 1-second TTFT and 50 TPS, given equivalent output length.
  • Code generation and agentic tasks: Both metrics matter, but differently per step. Agentic workflows make multiple sequential calls; TTFT accumulates across each step. For a 20-step agent workflow, a 3-second TTFT per call adds 60 seconds of total waiting across the chain. For single-shot code generation with long outputs, TPS drives the overall time.
  • Long-context document analysis: TPS dominates for the same reason as batch processing. But TTFT also increases with input length because prefill time scales with prompt tokens. Very long prompts can drive TTFT up on any model regardless of architecture.

How to Read Benchmark Numbers Without Being Misled

Four practices reduce the risk of choosing a model based on misleading benchmark data:

Check the thinking level when reading reasoning model benchmarks.Gemini 3.5 Flash at "low" thinking has a fundamentally different TTFT than at "high" thinking. Benchmarks that do not specify the thinking level can be comparing different product configurations.

Look at p95, not just median.A median TTFT of 597ms with a p95 of 612ms (as measured on Claude Haiku 4.5 in recent benchmarks) indicates infrastructure that handles load predictably. A median TTFT of 500ms with a p95 of 4,000ms indicates a system where worst-case performance is 8x worse than typical. For user-facing applications, p95 is the number that determines how often users notice a problem.

Note the input token length of the test.Artificial Analysis updated its default benchmark workload to 10,000 input tokens to better reflect production use cases. A TTFT measured on 100-token prompts will be significantly lower than the same model tested on 10,000-token prompts, because prefill time scales with input length. MiMo-V2.5-Pro's 3.76s TTFT is measured at the 10,000-token workload, not at a short conversational prompt.

Separate provider performance from model performance.The same model served by different providers produces different TTFT and TPS numbers. For MiMo-V2.5-Pro, TPS ranges from 51.4 tokens per second on GMI Cloud to 66.5 on DeepInfra. Choosing a provider for a specific model matters as much as choosing the model.

Accessing These Models Through GMI Cloud

GPT-5.4-nano, Gemini 3.5 Flash, and MiMo-V2.5-Pro are accessible through GMI Cloud's MaaS layer under a single API key. GMI Cloud's MiMo-V2.5-Pro performance is tracked on Artificial Analysis, which records 51.4 tokens per second and $0.51 per million blended tokens as of May 2026.

For teams evaluating latency profiles before committing to a model for production, running comparison benchmarks across these models on the same prompt sets is more reliable than reading published benchmarks with different test conditions. The unified API surface on GMI Cloud makes that comparison straightforward: change the model identifier, keep everything else constant, and measure TTFT and TPS on prompts that match actual production use.

Model access and documentation are atconsole.gmicloud.aianddocs.gmicloud.ai.

Benchmark the Metric That Matches Your Bottleneck

TTFT and TPS are not alternatives to each other in the same way that quality and cost are tradeoffs. They measure different things. A high-TPS model with high TTFT can be the right choice for batch workloads and the wrong choice for chat interfaces, simultaneously.

The productive starting point is identifying which metric determines the user experience for the specific deployment. For chat, that is almost always TTFT. For batch processing, it is almost always TPS or total throughput. For agentic workflows, TTFT accumulates across sequential steps. For long-document analysis, both matter at different stages of the same request.

Running a benchmark that measures the metric that matches the actual bottleneck produces actionable data. Running the other metric produces a number that looks informative but does not predict the production experience.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started