Compare Inference Latency Across AI Inference Providers
April 08, 2026
Inference latency equals time-to-first-token (TTFT) plus decode speed, and your hardware determines both more than anything else you'll configure in software. If you're benchmarking providers and getting inconsistent numbers, you're probably conflating three distinct metrics that don't move together.
GMI Cloud runs H100 SXM and H200 SXM clusters with NVLink 4.0 and 3.2 Tbps InfiniBand, which is the infrastructure configuration that actually drives the numbers below.
What Inference Latency Actually Measures
Latency in LLM inference has three components, and conflating them is how teams end up optimizing the wrong thing. TTFT (time-to-first-token) is how long a user waits before seeing any output. This is dominated by prefill compute, which scales with input sequence length and model size.
Decode speed (tokens per second) determines how fast the rest of the response streams out after that first token. This is almost entirely a function of GPU memory bandwidth, not compute.
Cold start is the time to load a model into GPU VRAM before any inference can begin, and it's the latency component that's invisible in steady-state benchmarks but brutal in bursty production traffic.
Understanding which component is your bottleneck changes every optimization decision you'll make downstream.
Provider and Hardware Comparison
Latency performance differences between providers come down to hardware class, software stack, and cluster configuration. Here's how the major GPU options compare on the metrics that drive real production decisions.
| GPU | VRAM | Memory BW | FP16 TFLOPS | FP8 TFLOPS | NVLink BW | Relative Inference Speed |
|---|---|---|---|---|---|---|
| H200 SXM | 141 GB HBM3e | 4.8 TB/s | 989 | 1,979 | 900 GB/s bidirectional | Fastest (up to 1.9x vs H100 on Llama 2 70B) |
| H100 SXM | 80 GB HBM3 | 3.35 TB/s | 989 | 1,979 | 900 GB/s bidirectional | Baseline reference |
| A100 80GB | 80 GB HBM2e | 2.0 TB/s | 312 (TF32) | N/A (no FP8) | 600 GB/s | ~1.7x slower than H100 on decode |
| L4 | 24 GB GDDR6 | 300 GB/s | 121 (FP32) | 242 (FP8) | PCIe only | Best for small models, batch-1 |
Sources: NVIDIA H100 Tensor Core GPU Datasheet (2023); NVIDIA H200 Tensor Core GPU Product Brief (2024); NVIDIA A100 Tensor Core GPU Datasheet; NVIDIA L4 Tensor Core GPU Datasheet.
The H200 benchmark cited above (up to 1.9x inference speedup on Llama 2 70B vs. H100) was measured by NVIDIA using TensorRT-LLM at FP8 precision, batch size 64, with 128 input / 2,048 output tokens. That gap widens as model size and context length increase, because both are memory-bandwidth-bound problems.
How GPU Memory Bandwidth Drives Decode Latency
Here's the thing most latency guides skip: during the decode phase, the GPU spends most of its time reading model weights from VRAM, not doing math. A 70B parameter model at FP16 precision occupies roughly 140 GB. On an H100 with 3.35 TB/s bandwidth, reading those weights takes about 42 milliseconds per token.
On an H200 at 4.8 TB/s, that drops to about 29 milliseconds per token.
That's the primary source of the speedup. More compute doesn't help if the weights can't get to the compute units fast enough. This is why the A100's lower memory bandwidth (2.0 TB/s) produces meaningfully slower decode rates than the H100 at the same model precision, even though both GPUs fit 80 GB of VRAM.
The implication for provider selection: any provider claiming fast decode speeds on large models must be running high-bandwidth GPU memory. Ask which GPU SKU backs their inference endpoints before trusting latency benchmarks.
KV-Cache and TTFT at Long Context
TTFT grows with context length, and if you're running agentic workflows or retrieval-augmented generation with large context windows, this becomes your primary latency problem. The KV-cache grows as follows:
KV per request approximately equals 2 times the number of layers, times the number of KV heads, times head dimension, times sequence length, times bytes per element.
For Llama 2 70B (80 layers, 8 KV heads, 128 head_dim) at FP16 precision with a 4K context window, that's roughly 0.4 GB per concurrent request. At 32K context, you're at 3.2 GB per request. On an H100 with 80 GB VRAM, you can support roughly 25 concurrent 32K-context sessions before you're memory-constrained.
The H200's 141 GB expands that to roughly 44 sessions on the same parameters.
More VRAM doesn't just mean bigger models. It means more concurrent context and higher throughput at long sequence lengths.
Optimization Techniques That Actually Move the Needle
Once you've picked the right hardware, there are three software-level techniques that materially reduce latency.
Quantization to FP8 reduces model weight size by half compared to FP16, which directly reduces the amount of data the memory subsystem has to move per token. H100 and H200 both support native FP8 compute.
For most models, FP8 quantization with TensorRT-LLM produces no perceptible quality degradation at batch sizes used in production.
Speculative decoding uses a smaller draft model to predict several tokens ahead, then verifies them in parallel on the main model. When the draft model is accurate (which it usually is for common token sequences), you get multiple tokens from a single large model forward pass.
Effective speedups of 2x to 3x have been reported on code generation tasks, though results vary heavily by domain and model pair.
Continuous batching keeps GPU utilization high by dynamically grouping requests regardless of where they are in the decode sequence. Without it, a GPU waits for every request in a batch to finish before starting the next batch. With it, new requests slot into the decode loop as soon as a slot opens.
This is now standard in vLLM and TensorRT-LLM, but it requires the serving stack to support it.
These three techniques stack. FP8 plus continuous batching plus speculative decoding can push an H100 cluster to 2 to 4 times the throughput of a naive HuggingFace Transformers serving setup on the same hardware.
Cold Start Latency and Why It's Underrated
Cold start is the latency component that doesn't show up in steady-state benchmarks but dominates the P99 experience for bursty traffic. Loading a 70B FP16 model from NVMe storage into 80 GB of GPU VRAM across a PCIe 4.0 x16 link (64 GB/s theoretical) takes roughly 2 to 3 seconds under ideal conditions.
In practice, with OS page cache pressure and concurrent I/O, it can take 10 to 30 seconds.
Providers that pre-warm popular models or keep them resident in VRAM eliminate this cost entirely for standard model endpoints. For custom models, you'll need to manage warm instance pools yourself or accept cold-start penalties at low traffic.
This is a concrete advantage of managed inference APIs for standard model workloads: they absorb the warm-pool engineering cost so you don't have to.
GMI Cloud Cluster Configuration for Low Latency
GMI Cloud H100 and H200 nodes ship pre-configured with the software stack that makes these optimization techniques practical. Each node runs 8 GPUs with NVLink 4.0 at 900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms.
Inter-node connectivity runs at 3.2 Tbps InfiniBand, which matters for tensor-parallel serving across multiple nodes.
The pre-installed stack includes TensorRT-LLM, vLLM, and Triton Inference Server, so you can enable continuous batching and FP8 inference without configuring the serving layer from scratch. H100 runs at approximately $2.00/GPU-hour and H200 at approximately $2.60/GPU-hour. Check gmicloud.ai/pricing for current rates.
For teams that don't need full GPU instance control, the Inference Engine provides no-provisioning access to 100+ pre-deployed models, with pricing from $0.000001 to $0.50 per request.
Cold start is handled on the platform side (GMI Cloud Inference Engine page, snapshot 2026-03-03; check gmicloud.ai for current availability and pricing).
Conclusion
Inference latency breaks into TTFT, decode speed, and cold start. Hardware determines the ceiling on all three, and GPU memory bandwidth is the dominant driver of decode performance. The H200 leads on large model inference by a significant margin. The H100 is the right baseline for everything else.
Software optimizations (FP8, continuous batching, speculative decoding) stack on top of hardware selection and can compound to 2x to 4x throughput improvements.
When comparing providers, ask for hardware SKU, whether they run continuous batching, and what their warm-pool strategy is. Those three answers will tell you more than any marketing benchmark.
FAQ
Q: What's the single most impactful thing I can do to reduce LLM inference latency? Move to a higher memory bandwidth GPU. The jump from A100 (2.0 TB/s) to H100 (3.35 TB/s) reduces decode latency by roughly 40% on large models without changing anything else.
Software optimizations help, but they can't overcome a hardware bandwidth constraint.
Q: Does FP8 quantization hurt model quality? For most production models, the quality difference between FP16 and FP8 is negligible at typical batch sizes and sequence lengths. NVIDIA's TensorRT-LLM FP8 implementation uses calibration data to minimize accuracy loss.
Run your own evals on your specific task before committing, but don't assume FP8 means worse output.
Q: How does speculative decoding affect latency vs. throughput? Speculative decoding reduces per-request latency (particularly TTFT and time-to-last-token) but may reduce maximum throughput if the draft model adds overhead at very high batch sizes.
It's most beneficial for interactive, low-batch-size serving where latency is prioritized over throughput.
Q: What's a realistic TTFT target for production LLM APIs? Under 500ms TTFT is a common target for interactive applications. Sub-200ms is achievable with H100/H200 on models up to 13B parameters at moderate context lengths.
For 70B models at 8K+ context, 500ms to 1,500ms is more realistic depending on batch conditions and hardware configuration.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
