What's a good TTFT target for streaming responses?

Aim for <100ms TTFT for interactive applications. Below 50ms feels instantaneous. The difference between 40ms and 80ms is imperceptible to humans, but 200ms+ creates noticeable delay. For batch workloads, TTFT doesn't matter; optimize for throughput instead.

Should I believe GPU vendor benchmarks or third-party ones?

Both matter. NVIDIA and AMD benchmarks show hardware ceiling and theoretical advantage. Third-party benchmarks (Artificial Analysis, MLPerf) reveal which platforms actually achieve those ceilings. Comparing hardware ceiling to third-party platform results shows software optimization efficiency. If a platform delivers only 50% of theoretical hardware capacity, they're leaving money on the table.

How much does concurrency level change performance results?

Dramatically. Most platforms show beautiful single-request metrics, then performance tanks at 20+ concurrent requests due to memory fragmentation, queue contention, or batching inefficiencies. Always test at your actual concurrency. A platform claiming "100 tokens/sec" at concurrency=1 might hit only 40 tokens/sec at concurrency=50.

Is FP8 quantization safe for all models?

No, but it's safe for most modern LLMs. Test quantization on your specific models before committing. LLMs (Llama 3, GPT-4 variants, Mistral) show <0.5% accuracy loss with FP8 in published tests. Older models and certain vision-language models sometimes need INT8 or careful calibration. Always benchmark your actual model before production.

How to Benchmark AI Inference Platforms for Peak Performance in 2026

Q: What's a good TTFT target for streaming responses?

Aim for <100ms TTFT for interactive applications. Below 50ms feels instantaneous. The difference between 40ms and 80ms is imperceptible to humans, but 200ms+ creates noticeable delay. For batch workloads, TTFT doesn't matter; optimize for throughput instead.

Q: Should I believe GPU vendor benchmarks or third-party ones?

Both matter. NVIDIA and AMD benchmarks show hardware ceiling and theoretical advantage. Third-party benchmarks (Artificial Analysis, MLPerf) reveal which platforms actually achieve those ceilings. Comparing hardware ceiling to third-party platform results shows software optimization efficiency. If a platform delivers only 50% of theoretical hardware capacity, they're leaving money on the table.

Q: How much does concurrency level change performance results?

Dramatically. Most platforms show beautiful single-request metrics, then performance tanks at 20+ concurrent requests due to memory fragmentation, queue contention, or batching inefficiencies. Always test at your actual concurrency. A platform claiming "100 tokens/sec" at concurrency=1 might hit only 40 tokens/sec at concurrency=50.

Q: Is FP8 quantization safe for all models?

No, but it's safe for most modern LLMs. Test quantization on your specific models before committing. LLMs (Llama 3, GPT-4 variants, Mistral) show <0.5% accuracy loss with FP8 in published tests. Older models and certain vision-language models sometimes need INT8 or careful calibration. Always benchmark your actual model before production.

April 20, 2026

Why "Best Performance" Isn't One Number

You've seen marketing claims: "Our platform is 40% faster." But faster at what? TTFT under light load? Throughput during peak concurrency? P95 latency during traffic spikes? Cold starts? A platform that dominates TTFT might choke on throughput. Another wins on cost-per-token but suffers p95 tail latency. "Best performing" depends entirely on what you measure. This article walks you through a four-metric framework that separates legitimate performance claims from marketing noise.

Four Metrics That Tell the Real Story

Professional inference benchmarking requires four independent metrics working together. TTFT measures responsiveness for user-facing applications. Tokens per second reveals throughput ceiling and how efficiently the platform handles batch concurrency. P95 latency exposes tail behavior when systems are under stress. Cold start time matters for serverless and spot-instance scenarios. Together, these four metrics prevent you from optimizing for the wrong variable. A platform can excel at TTFT while tanking on p95 latency, or show beautiful average throughput while cold starts destroy user experience. You'll learn how to test each metric rigorously, then interpret results in the context of your actual workload.

Test Methodology: Load Profiles, Concurrency, Sequence Lengths

How you test matters as much as what you test. Here's the methodology that separates signal from noise:

Load profile setup defines the test environment. Start with a "ramp-up" phase: gradually increase concurrency from 1 to your target (e.g., 50 concurrent requests). Run steady-state for at least 5-10 minutes at target concurrency. Then ramp down. Never test single-request scenarios; they hide contention and batching behavior.
Sequence length variation exposes model behavior across input types. Test three scenarios: short input (50 tokens) with short output (20 tokens), medium input (500 tokens) with medium output (100 tokens), long input (2000 tokens) with long output (500 tokens). A platform might excel on short sequences but collapse on long ones due to memory fragmentation.
Concurrency levels should match your actual traffic. If you serve 50 concurrent users, test at 50. If you peak at 200, test there. Testing at concurrency=1 is worse than useless; it's deceptive. Most platforms show "warm batch" metrics that disappear when real traffic arrives.
Warm-up phase prevents JIT compilation and cache misses from skewing results. Run 100-500 test requests through the platform before collecting metrics. Cold-start metrics should be measured separately after platform restart, not mixed with warm-cache metrics.

The unintuitive rule: platforms often hide poor concurrency behavior. Always test at the concurrency level you'll actually use. A service that shows great single-request latency might catastrophically degrade at 50 concurrent users.

Hardware Ceiling and Runtime Impact

Hardware and software optimization stack. You need both. Here's what changes the game:

H100 baseline (80 GB HBM3, 3.35 TB/s bandwidth) serves as the reference standard. A 70B model on H100 with FP32 precision generates roughly 50-80 tokens/sec. With FP8 quantization and continuous batching, that jumps to 120-160 tokens/sec. The hardware ceiling is real, but software optimization unlocks 60-80% of theoretical maximum.
H200 advantage (141 GB HBM3e, 4.8 TB/s bandwidth) doesn't just run larger models; it accelerates all models. The same 70B model on H200 hits 150-200 tokens/sec with continuous batching and FP8. That's 1.4-1.6x improvement over H100 in real tests, not the NVIDIA marketing number of 1.9x.
GB200 performance (next-generation Blackwell architecture, available at $8.00/GPU-hour). Independent benchmarks for GB200 are still emerging. Early indicators suggest significant throughput gains over H200. This is where larger models become more practical. Cold start latency may increase slightly due to larger memory footprint.
Runtime stack matters as much as hardware. TensorRT-LLM (NVIDIA) and vLLM (open source) show within 10% performance on the same hardware when both are fully optimized. Older serving stacks (e.g., custom Python implementations) lose 30-50% of theoretical hardware capacity. Ask your provider which serving runtime they use and what version.

GPU	HBM Capacity	Memory Bandwidth	Approx. 70B Tokens/Sec (FP8 + batching)	Cost/Hour
H100	80 GB	3.35 TB/s	120-160	$2.00
H200	141 GB	4.8 TB/s	150-200	$2.60
GB200	Next-gen Blackwell	High	Contact for benchmarks	$8.00
B200	~192 GB	~8.0 TB/s (est.)	250-320 (est.)	$4.00

The unintuitive takeaway: GB200 costs 4x per hour, but if you're running 200B models, you might run 5x fewer GPUs, ending up 20% cheaper overall. Run the math for your model size.

External Validation: Third-Party Benchmarks and Methodology

Marketing benchmarks lie. Third-party benchmarks tell the truth. Here's which sources to trust and why:

Artificial Analysis publishes real-world provider benchmarks by querying actual APIs. They measure TTFT and tokens/sec under consistent load. No vendor influence, no cherry-picked results. Their data provides regular snapshots of provider performance.
MLPerf Inference is the gold standard for reproducibility. Their benchmarks use standardized models, input sets, and measurement protocols. If a provider doesn't run MLPerf, ask why. MLPerf results can't be gamed because the test suite is published and reproducible.
GenAI-Perf (NVIDIA's benchmark framework) is transparent and reproducible despite NVIDIA authorship. It measures TTFT, tokens/sec, and p95 latency under configurable load. The tool is open source, so you can run it yourself against competing platforms and compare apples-to-apples.
Run your own benchmarks on your own workload. No third-party benchmark perfectly matches your use case. Once you've narrowed platforms using Artificial Analysis and MLPerf, request trial access and run GenAI-Perf against your actual models, sequence lengths, and concurrency levels. This is where you find the winner.

The rule: always demand reproducible methodology. If a provider claims "40% faster" but won't share the test setup, concurrency level, or hardware configuration, they're marketing, not benchmarking.

Complete Benchmarking Protocol Checklist

Use this checklist before committing to a platform:

Load testing at actual concurrency - Test at your peak concurrent request count, not at concurrency=1. Measure TTFT, tokens/sec, and p95 latency at each concurrency level (10, 25, 50, 100+).
Sequence length variety - Test short, medium, and long sequences. Measure separately so you know where each platform struggles. Some platforms collapse on long sequences due to memory fragmentation.
Cold start measurement - Restart the platform and measure TTFT for the first request. Compare cold start (first request) vs. warm start (after 100 warm-up requests). If cold start is 10x worse, serverless workloads will suffer.
P95 and p99 latency reporting - Don't accept median latency alone. Insist on p95 and p99 percentile latency. If p95 is 3x median, the platform has concurrency issues.
Third-party validation - Cross-check results against Artificial Analysis, MLPerf, or GenAI-Perf. If your results wildly diverge from public benchmarks, investigate why.
Cost-per-inference calculation - Convert hardware cost and throughput into actual cost per inference. Don't optimize for one metric; optimize for cost per inference under your specific concurrency and sequence-length profile.

Real-World Benchmark Performance on Modern Infrastructure

GMI Cloud published GenAI-Perf results for Llama 3 70B FP8 on H200 infrastructure. The measured TTFT was 40% faster than comparable AWS instances running A100 in internal testing using GenAI-Perf. On the unified MaaS model library, Llama 3 70B runs with competitive throughput under production load. GMI Cloud also tested DeepSeek V3 on H200, achieving strong throughput with no quantization loss. As an NVIDIA Preferred Partner built on NVIDIA Reference Platform Cloud Architecture, this explains reproducibility and optimization consistency across deployments.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started