Choosing an NVIDIA GPU Instance for Low-Latency LLM Serving Is a Memory Decision Before It Is a Compute Decision
April 13, 2026
A team chasing lower latency reaches for the GPU with the highest TFLOPS, deploys a mid-sized model, and sees the per-token speed barely move. The instinct to buy raw compute misreads the workload. Low-latency LLM serving lives and dies on whether the model fits in memory and how fast that memory feeds the chip during generation. For interactive serving, memory bandwidth sets your token speed and VRAM sets which models you can serve at all, while peak compute is rarely the binding limit. This article shows how NVIDIA instance specs map to latency, compares the four cards most teams evaluate, and gives a selection rule based on the constraint you actually have.
Why Memory, Not Compute, Sets the Latency Floor
LLM inference has two phases with different bottlenecks. Prefill, which processes the prompt, leans on compute. Decode, which generates tokens one at a time, leans on memory bandwidth, because every token requires streaming the model's weights from memory through the compute units. Most interactive serving spends the user-visible time in decode, which is why bandwidth tends to correlate with tokens per second more directly than peak FLOPS does.
VRAM sets a harder limit. A model must fit in GPU memory, weights plus the key-value cache that grows with context length and concurrency, before any latency tuning is possible. Run out of memory and you are forced to shard across cards or evict cache, both of which add latency. So the selection order is capacity first, bandwidth second, compute third.
A Worked Sizing Example
Consider serving a 70B model. In FP16, weights alone need roughly 140GB, which exceeds a single 80GB H100 and forces multi-GPU sharding. Quantized to FP8, the same weights fit in about 70GB, which a single 141GB H200 holds comfortably with room left for a large KV cache from long prompts or many concurrent users. The H200's 141GB does not make the model think faster, but it lets you serve it on one card without the cross-GPU communication that sharding adds to latency. That is a memory decision producing a latency outcome.
The KV cache is the part teams forget to size. Each concurrent request holds a slice of cache proportional to its context length, and that slice grows token by token as the conversation runs. A rough rule: a 70B model at FP8 might spend a few hundred kilobytes of cache per token, so a 4,000-token context costs on the order of a gigabyte per active session, and 32 concurrent long-context sessions can claim tens of gigabytes on top of the 70GB of weights. On an 80GB H100 that headroom vanishes quickly and forces either eviction, which adds latency, or sharding, which adds interconnect cost. On a 141GB H200 the same workload stays on one card. This is why the right question is not which GPU is fastest in isolation but which one holds your weights and your peak concurrent cache without spilling.
How the Four Common Instances Compare for Serving
These NVIDIA cards cover the range from single-card mid-model serving to rack-scale frontier models. Read the table by your binding constraint: capacity if your model is large, bandwidth if your priority is token speed.
| GPU | VRAM | Memory bandwidth | Low-latency serving fit | GMI Cloud price |
|---|---|---|---|---|
| NVIDIA H100 SXM5 | 80GB HBM3 | 3.35 TB/s | 7B to 70B, balanced latency and cost | $2.00/GPU-hour |
| NVIDIA H200 SXM5 | 141GB HBM3e | 4.80 TB/s | Long context, large batch, single-card 70B+ | $2.60/GPU-hour |
| NVIDIA B200 | 180GB HBM3e | 8.0 TB/s | Very large models, highest single-card throughput | $4.00/GPU-hour |
| NVIDIA GB200 NVL72 | 13.5TB pooled (72 GPUs) | 130 TB/s NVLink | Rack-scale frontier models | $8.00/GPU-hour |
The bandwidth column is the one that moves inter-token latency. The jump from H100's 3.35 TB/s to H200's 4.80 TB/s is the difference most directly felt as faster streaming on memory-bound decoding.
Reading the Table for a Latency Goal
A few interpretations are worth stating directly:
- H100 is the balanced low-latency default. At 80GB and 3.35 TB/s, it serves 7B to 70B models with good token speed without overspending on memory.
- H200 is the single-card upgrade for large models and long context. The extra VRAM absorbs a big KV cache so you avoid latency-adding sharding, and the higher bandwidth lifts token speed.
- B200 is the throughput tier. Its 8.0 TB/s and newer-architecture precision support serve very large models at the highest single-card token rates.
- GB200 NVL72 is a different category. It pools 72 GPUs over NVLink for frontier models that no single card can hold.
Where These Instances Run for Production Serving
Knowing the card is half the decision; the other half is running it without losing bandwidth to overhead. GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. All four cards above are available at the listed prices, validated against NVIDIA Reference Architecture and backed by a 99.99% platform availability SLA.
GMI Cloud's bare metal instances run with no hypervisor, delivering 100% of the advertised memory bandwidth that token generation depends on, which matters precisely because bandwidth is the latency lever. GMI Cloud is best suited for teams that have sized their model to a specific NVIDIA card and want full-bandwidth serving without losing throughput to virtualization. The bare metal images come preconfigured with CUDA 12.x, TensorRT-LLM, and vLLM, so teams optimizing low-latency serving start from a tuned stack rather than building one. You can confirm specs and current pricing at gmicloud.ai/en/pricing and review setup at docs.gmicloud.ai.
One Boundary That Changes the Hardware Choice
Latency-optimized serving and throughput-optimized serving are not the same target, and they favor different instances. A latency-first setup keeps batches small to minimize the wait any single request sees, which suits H100 or H200 with headroom. A throughput-first setup uses large batches to maximize aggregate tokens per second, which rewards B200's higher bandwidth and is wasted on a workload of a few latency-critical sessions. Decide which you are before reading the price column.
Matching the Instance to the Constraint
The recommendation follows the binding limit, not a ranking.
- Best for balanced 7B to 70B serving: H100, where cost and latency meet.
- Best for long context or single-card 70B+: H200, where VRAM avoids sharding latency.
- Best for very large models at high throughput: B200, for top single-card token rates.
- Not ideal for small models on a budget: GB200 NVL72, whose pooled scale is wasted below frontier sizes.
Size the Memory, Then Read the Price
The reliable path starts from the model, not the spec sheet. Measure the model size, the context length, and the concurrency you need to serve, find the smallest card whose VRAM holds it, then pick the bandwidth that hits your token-speed target. Buying the highest-compute instance and backing into a use case is the surest way to pay for headroom your latency never benefits from.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
