Other

Ultra-Low Latency LLM Inference Is Three Different Numbers, and Optimizing the Wrong One Hides the Problem

April 13, 2026

A team reports an average response time of 800 milliseconds and calls the service fast. Then a customer complains that the app freezes for three seconds, sometimes. Average latency hid the truth. LLM serving latency is not one number. It is at least three, and they move independently: time to first token, the gap between tokens, and the slow tail that only shows up at the 99th percentile. A chat that feels instant and one that feels broken can share the same average while differing entirely on the tail. This article defines the three metrics that matter, shows how to target each, and gives a practical optimization order.

The Three Latency Numbers That Decide Perceived Speed

Streaming changed how latency is felt. Because tokens arrive one at a time, users react to when text starts and how smoothly it flows, not to when the full answer is done. Three metrics capture that experience.

Time to First Token (TTFT)

TTFT is the delay between sending a prompt and seeing the first token. It is dominated by the prefill phase, where the model processes the entire input prompt before generating anything. Long prompts and large models push TTFT up. For interactive chat, TTFT is what users perceive as responsiveness, and targets under a few hundred milliseconds feel immediate.

Inter-Token Latency (ITL)

ITL is the gap between successive tokens during generation. It sets the streaming speed, often reported as its inverse, tokens per second. A model generating at 50 t/s emits a token roughly every 20 milliseconds, which reads faster than most people. ITL is governed by memory bandwidth during the decode phase, because each token requires moving the model weights through compute.

p99 Tail Latency

p99 is the latency that 99% of requests stay under. It exposes what averages hide: the occasional request that stalls from queueing, a cold start, a noisy neighbor, or an unusually long generation. A service with a great average and a terrible p99 feels unreliable, because users remember the freezes. Tail latency is the metric that production SLAs are written against.

Targets and the Architecture Behind Each

The three metrics respond to different fixes, which is why a single optimization rarely moves all of them.

Metric What it measures Typical good target Primary lever
TTFT Prompt to first token Under 300 ms Prefill speed, prompt length, compute
ITL Gap between tokens Under 25 ms (40+ t/s) Memory bandwidth, batching
p99 Worst-case under load Within 2x of median Capacity headroom, queueing, warm pools

The lever column is the practical guide. TTFT is largely a prefill and compute problem, so it improves with faster cards and shorter or cached prompts. ITL is a bandwidth problem, so it improves with higher-bandwidth GPUs and efficient batching. p99 is a systems problem, so it improves with headroom, smart queueing, and avoiding cold starts.

An Optimization Order That Works

A worked sequence keeps effort focused. First, measure all three under realistic load, not in isolation, because batching trades ITL against throughput. Second, fix TTFT with prompt caching and adequate prefill compute, since first-token delay is the most visible. Third, lift ITL by moving to a higher-bandwidth GPU or tuning batch size. Fourth, attack p99 by adding capacity headroom and keeping instances warm so cold starts do not spike the tail. Optimizing p99 before TTFT is a common mistake; the tail does not matter if the median already feels slow.

A concrete trace shows how the three numbers compose into perceived speed. Imagine a chat request with a 2,000-token prompt and a 300-token answer. If prefill takes 250 milliseconds, that is your TTFT, the pause before any text appears. If the model then streams at 50 t/s, each token lands every 20 milliseconds, so the full 300-token answer finishes about six seconds after the first token. To the user, the response felt responsive because text started in a quarter second, even though total completion took over six. Now suppose one request in a hundred hits a cold worker and waits two extra seconds before prefill begins. That request alone barely moves the average, but it sets the p99 the SLA is judged on. Reading the trace this way makes the optimization order obvious: shorten the visible TTFT first, keep ITL smooth so the stream does not stutter, and protect the tail so the unlucky request does not define the service.

Matching Models to the Latency Goal

Model choice sets the latency floor before any infrastructure tuning. Faster and lighter models generate tokens sooner and finish quicker.

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. Its model library spans the speed-cost range that latency-sensitive teams compare: Gemini 3.5 Flash at 278 t/s for the fastest streaming, GPT-5.4-mini at $0.40/M input for a lighter mid-tier, and DeepSeek-V4-Pro at 55 to 60 t/s for a low-cost high-throughput option. GMI Cloud's bare metal instances run with no hypervisor, delivering 100% of the advertised memory bandwidth that inter-token latency depends on, which removes a source of ITL variance that virtualized instances introduce. GMI Cloud is best suited for teams whose latency target is bound by inter-token speed or tail latency rather than by raw model size, since its model range and full-bandwidth hardware address both. The platform reports under 200 ms average cross-region latency and a 99.99% availability SLA, both of which bear directly on the p99 tail. You can review models and integration details at console.gmicloud.ai and docs.gmicloud.ai.

One Distinction Worth Making Explicit

Throughput and latency are not the same goal, and tuning for one can hurt the other. Larger batch sizes raise total tokens per second across all users, which is throughput, but they can increase the latency any single user sees, because their request waits to be batched. A serving setup optimized for maximum throughput on a benchmark can deliver worse interactive latency than a setup tuned for small batches. Decide first whether you are serving a few latency-critical sessions or maximizing aggregate volume, because the batching choice flows from that.

Choosing for the Latency That Binds

The right setup depends on which of the three numbers your users actually feel.

  • Best for interactive chat: optimize TTFT first, with a fast model like Gemini 3.5 Flash and prompt caching.
  • Best for long streaming responses: optimize ITL, with high memory bandwidth and bare metal to avoid hypervisor overhead.
  • Best for SLA-bound production: optimize p99, with capacity headroom and warm pools.
  • Not ideal for latency-critical paths: maximum-batch throughput tuning, which trades single-request speed for aggregate volume.

Pick the Metric Your Users Feel, Then Tune for It

Averages are comforting and misleading. Before tuning anything, decide which of TTFT, ITL, and p99 your users actually experience, instrument all three under real load, and optimize in that order. A service that is honest about its tail latency and fast where it counts will feel quicker than one with a better average and a worse worst case.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started
Ultra-Low Latency LLM Inference: 3 Numbers