Other

How Much GPU Memory and Bandwidth LLM Inference Actually Needs, Calculated From the KV Cache Up

April 13, 2026

Most GPU-sizing advice starts with the model's parameter count and stops there. That number tells you whether the weights fit, but it says nothing about the part of memory that grows while the model runs. The key-value cache expands with every token, every concurrent request, and every point of context length, and it is what turns a comfortable deployment into an out-of-memory crash at peak traffic. The honest way to size an inference GPU is to add the weight footprint and the worst-case KV cache, then check whether bandwidth can feed that total fast enough. This article walks the calculation from the cache up and shows where the 80GB and 141GB tiers actually land.

The Two Things Sitting in GPU Memory

During inference, GPU memory holds two distinct things, and only one of them is fixed. The first is the model weights, a constant set by parameter count and precision. The second is the KV cache, which is dynamic and grows with the work you ask the model to do. Sizing a GPU by weights alone ignores the half of memory that actually moves.

Weights Are the Fixed Floor

Weight footprint follows a simple rule: parameters times bytes per parameter. A 70B model in FP16 uses two bytes per parameter, roughly 140GB for weights alone. Quantize the same model to FP8 and that footprint roughly halves to around 70GB. Quantize to FP4 and it halves again. Precision is the first lever, because it sets the floor before any request arrives.

The KV Cache Is the Variable That Surprises Teams

The KV cache stores the attention keys and values for every token already processed, so the model does not recompute them on each new token. Its size scales with four things at once: the number of layers, the hidden dimension, the context length, and the number of concurrent sequences. Double the context and the cache roughly doubles. Double the concurrency and it doubles again. This is why a deployment that runs fine in testing at one user and short prompts falls over at fifty users and long prompts. The weights never changed. The cache did.

The practical takeaway is that KV cache, not weights, usually decides how much headroom you need above the model floor. A team serving long documents to many users at once can see the cache rival or exceed the weight footprint.

Reading the Memory Tiers Against Real Workloads

The table below maps the two common single-card tiers to the work each absorbs, using VRAM and bandwidth as the quantifiable columns, because those two numbers set capacity and speed respectively.

GPU VRAM Memory bandwidth GMI Cloud price Headroom profile
NVIDIA H100 SXM5 80GB HBM3 3.35 TB/s $2.00/GPU-hour 70B quantized, or smaller models at high concurrency
NVIDIA H200 SXM5 141GB HBM3e 4.80 TB/s $2.60/GPU-hour 70B in higher precision, long context, large KV cache

A few readings are worth making explicit.

  • 80GB is enough only after quantization for 70B-class models. An FP16 70B does not fit; an FP8 version at around 70GB fits but leaves little room for a large cache. The H100 is the right floor for quantized mid-size serving.
  • The jump to 141GB is bought for KV cache, not just weights. The H200's extra 61GB is what absorbs long prompts and high concurrency, which is exactly where the cache balloons.
  • Bandwidth sets the speed of whatever fits. Once the model and cache fit, the H200's 4.80 TB/s feeds tokens faster than the H100's 3.35 TB/s, which matters most for latency-sensitive serving.

A Rough Sizing Procedure You Can Run in Your Head

You do not need a profiler to get a first estimate. The order of operations is what matters.

  • Start with weights: parameters times bytes per parameter at your target precision.
  • Add a KV cache estimate: scale it up with your maximum context length and your peak concurrent requests, not your average.
  • Compare the total to VRAM, then leave a margin for fragmentation and framework overhead.
  • Only after capacity clears, check that bandwidth meets your tokens-per-second target.

Capacity is a hard wall and bandwidth is a speed dial. If the total exceeds VRAM, the deployment fails outright. If bandwidth is low, it merely runs slower. Sizing in that order keeps you from buying speed you cannot use on a model that does not fit.

The Boundary Between Capacity and Bandwidth

These two specs fail in different ways and should not be traded against each other casually. A capacity shortfall produces an out-of-memory error and stops inference completely. A bandwidth shortfall produces slower token generation while everything still runs. Teams that optimize for bandwidth while ignoring peak KV cache ship a fast deployment that crashes under load. Teams that optimize for capacity while ignoring bandwidth ship a stable deployment that feels sluggish. The correct sequence is capacity first, bandwidth second, every time.

Where the Memory Tiers Are Available for Inference

Once your calculation points to an 80GB or 141GB card, the next question is where to run it without paying for memory you sized away.

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The H100 at $2.00/GPU-hour and H200 at $2.60/GPU-hour are available now, validated against NVIDIA Reference Architecture and backed by a 99.99% platform availability SLA. GMI Cloud's bare metal instances run with no hypervisor, so a workload receives 100% of the advertised memory bandwidth your KV cache feed depends on, rather than losing part of it to virtualization overhead.

For workloads whose concurrency swings unpredictably, the variable that drives KV cache, GMI Cloud's serverless inference scales to zero so you are not paying for idle headroom between traffic peaks. You can confirm current pricing and the model library at gmicloud.ai/en/pricing and console.gmicloud.ai before locking a tier.

Matching the Tier to the Cache, Not the Parameter Count

The right memory tier follows from your worst-case cache, not your model's headline size.

  • Best for quantized 70B at moderate concurrency: H100, where 80GB holds an FP8 model with a modest cache.
  • Best for long context or high concurrency: H200, where 141GB absorbs a large KV cache the H100 cannot.
  • Best for latency-sensitive serving of whatever fits: H200, for the higher bandwidth.
  • Not ideal for very large models in full precision on one card: either single-card tier, which is the signal to quantize or move to B200-class capacity.

Size the Cache Before You Size the Card

The parameter count tells you the floor, and the floor is the easy part. The reliable estimate comes from the cache: pick your maximum context and your peak concurrency, add that to the weight footprint, and read the VRAM column against the total rather than the model name. The deployments that fail in production almost always failed the cache math, not the weight math. Run that calculation first and the GPU choice falls out of it.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started