What's a good TTFT target for chat applications?

Under 100ms for interactive chat. Most humans perceive responses under 200ms as instantaneous. Below 50ms is luxury territory. If your provider claims sub-10ms TTFT on a 70B model at production load, verify the test conditions carefully.

Does hardware alone determine latency?

No. Software optimization (continuous batching, speculative decoding, FP8 quantization) can deliver 4-8x throughput improvement on the same GPU. An unoptimized H200 can be slower than an optimized H100. Always ask what serving stack the provider runs.

How much does FP8 quantization help latency?

FP8 halves KV-cache memory usage and typically provides 1.5-2x throughput improvement with near-zero accuracy loss on most LLMs tested through 2025. It's the single easiest latency optimization available on H100/H200 hardware.

Should I benchmark on one provider or multiple?

Always test at least two. Run identical workloads on both for 7 days. Compare TTFT p50/p95, throughput under load, and total cost. The provider that looks best on paper often loses in production testing.

AI Inference Latency: What Providers Report vs What You Actually Get

April 27, 2026

A provider claims 50ms inference latency. After signing up and deploying a model, the actual measurement reads 300ms under real traffic. What happened? Providers report latency under ideal conditions: light load, small batch, short prompts. Production latency includes queue wait, cold start, batch contention, and network overhead. For any team building latency-sensitive AI features, understanding this gap is essential to choosing the right infrastructure. This article covers:

TTFT: the metric that determines whether your interactive app feels responsive
Throughput: how fast tokens flow after the first one
p95 latency: where provider marketing falls apart under real load

Three Metrics Define the Full Latency Picture

Inference latency isn't one number. It decomposes into three independent metrics that reveal different performance characteristics. Optimizing for one can worsen another. Providers who report only one metric are showing their best angle. All three are needed to evaluate honestly.

TTFT: The Metric Interactive Applications Live or Die By

Time-to-first-token (TTFT) measures the delay before the first output token arrives. Here's what drives it:

Model loading and prefill dominate TTFT. The model processes your entire input prompt before generating the first output token. Longer prompts mean longer prefill. A 2,000-token input on a 70B model takes significantly longer to prefill than a 100-token input.
Batch queue depth adds invisible delay. If 50 requests are queued ahead of a new one, the TTFT includes queue wait time that no benchmark captures. Providers running at 80%+ GPU utilization have deeper queues.
Target ranges: Interactive chat and voice AI need TTFT under 100ms. A voice AI customer saw TTFT drop from 300ms to 40ms by migrating to H200-based infrastructure with optimized serving (continuous batching + speculative decoding). Search needs under 200ms. Batch processing doesn't care about TTFT at all.

Throughput: How Fast Tokens Flow After the First One

Tokens per second measures generation speed after TTFT. This metric determines perceived response speed for long outputs:

Memory bandwidth is the bottleneck during decode. Each token requires reading KV-cache from VRAM. H200's 4.8 TB/s delivers more tokens per second than H100's 3.35 TB/s on the same model. The bandwidth gap translates directly to throughput difference.
Continuous batching is the biggest throughput lever. Instead of waiting for the longest request in a batch to finish before starting new requests, continuous batching interleaves new requests as slots open. This alone provides 2-4x throughput improvement on the same hardware.
Speculative decoding adds another 2-3x. A small draft model (8B parameters) predicts tokens, and the main model (70B) verifies them in parallel. When the draft model predicts correctly (which it does 70-85% of the time for standard text), multiple tokens get confirmed per step.

p95 Latency: Where Provider Marketing Falls Apart

p95 (95th percentile) latency shows what 1 in 20 requests actually experiences. This is where the gap between marketing and reality gets widest:

Median vs p95 gap reveals load-handling quality. Some platforms show a median of 50ms but p95 of 500ms. That means 5% of your users experience 10x worse latency. For a product serving 100,000 requests/day, that's 5,000 slow responses daily.
Causes of tail latency: garbage collection pauses in the serving stack, GPU memory reallocation between models, thermal throttling under sustained load, and network congestion on shared infrastructure. These are infrastructure problems, not model problems.
What to demand from providers: p95 latency under specified load conditions (e.g., "p95 under 100ms at 500 concurrent requests on Llama 70B FP8"). If a provider won't share p95 numbers, their tail latency is probably bad.

How to Actually Compare Providers

Don't trust benchmark pages. Run independent tests:

Use the actual workload: Synthetic benchmarks (short prompts, empty queues) produce numbers that never appear in production. Send the real prompt distribution, at the expected concurrency level, for at least 7 days.
Measure all three metrics: Track TTFT (p50 and p95), throughput (tokens/sec at load), and p95 end-to-end latency. Some providers optimize for TTFT at the expense of throughput. Others batch aggressively for throughput but tank TTFT.
Test during peak hours: Run load tests at peak internet traffic times (not 3 AM). GPU contention on shared infrastructure is worst during business hours in the provider's primary region.
Ask about infrastructure: What GPU class? What serving framework (vLLM, TensorRT-LLM, Triton)? What quantization? Do they use speculative decoding? Providers who answer openly tend to perform better.

Latency-Optimized Inference Infrastructure

GMI Cloud runs inference on H200 SXM (4.8 TB/s memory bandwidth) with pre-configured TensorRT-LLM and vLLM stacks. A production voice AI customer measured TTFT improvement from 300ms to 40ms after migrating from A100-based infrastructure to GMI Cloud's H200 platform, achieved through continuous batching and speculative decoding. As an NVIDIA Preferred Partner built on NVIDIA Reference Platform Cloud Architecture, the platform offers H100 from $2.00/GPU-hour and H200 from $2.60/GPU-hour for self-hosted inference, or 100+ pre-deployed models via per-request MaaS with no GPU management. Verify current rates on the pricing page.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started