vLLM Runs on Most NVIDIA GPUs, but the Quantization Kernels You Need Narrow the List Fast
April 13, 2026
A team standardizes on vLLM, assumes any modern NVIDIA card will serve its quantized model, and then discovers the FP8 kernel path it was counting on is only fast on the newer architecture it skipped. vLLM has broad hardware support, but "supported" and "runs your exact configuration at full speed" are different claims. The GPU question for vLLM is rarely whether it runs at all; it is whether the attention backend and quantization kernels your model depends on are accelerated on that specific architecture. This article maps how vLLM support varies across the GPUs most teams evaluate, why quantization kernel availability matters more than the support checkbox, and how to read a support matrix before you commit a fleet.
Why "Supported" Is the Wrong Question
vLLM is an inference and serving engine built around two ideas that decide its hardware behavior: PagedAttention for efficient KV cache management, and a library of optimized kernels for attention and quantized matrix multiplication. The engine runs on a wide range of NVIDIA GPUs. What changes across architectures is which of those optimized paths are available and fast.
This distinction matters because most production vLLM deployments are not running models in plain FP16. They quantize to fit larger models or to raise throughput. The moment you quantize, your performance depends on whether the kernel for that format is accelerated on your GPU's architecture rather than emulated or falling back to a slower path.
So the useful question is not "does vLLM support this GPU." It is "does vLLM have an accelerated kernel for my quantization format and attention pattern on this GPU's architecture."
The Three Capabilities That Decide vLLM Performance per GPU
Three hardware-linked capabilities separate a GPU that merely runs vLLM from one that serves your configuration well.
Native Low-Precision Support
Each NVIDIA architecture generation adds native support for newer numeric formats. Hopper-class GPUs accelerate FP8. Blackwell-class GPUs add native FP4. When vLLM has a kernel that targets a format the hardware accelerates natively, you get both higher throughput and a smaller memory footprint. When the format is not natively supported, you either run a higher precision or accept a slower path.
Attention Backend Availability
vLLM selects an attention backend depending on the GPU and the build. FlashAttention-class kernels and vLLM's paged kernels are tuned per architecture. Newer architectures generally get the most optimized backends first, which affects long-context throughput where attention cost dominates.
Memory Capacity and Bandwidth
PagedAttention makes KV cache use efficient, but it does not remove the ceiling. VRAM still caps model size and concurrency, and bandwidth still sets decode speed, because decoding is memory-bound. A GPU with more bandwidth serves more tokens per second on the same model.
How the Common 2026 GPUs Compare for vLLM
The table pairs each GPU with the vLLM-relevant capability that distinguishes it and the rate to rent it. Read it by the format you intend to serve: if you quantize to FP4, the native-support row is the one that matters.
| GPU | VRAM | Memory Bandwidth | vLLM-relevant strength | GMI Cloud price |
|---|---|---|---|---|
| NVIDIA H100 SXM5 | 80GB HBM3 | 3.35 TB/s | Mature FP8 kernels, broadest tested vLLM path | $2.00/GPU-hour |
| NVIDIA H200 SXM5 | 141GB HBM3e | 4.80 TB/s | Same FP8 maturity with more VRAM for long context and concurrency | $2.60/GPU-hour |
| NVIDIA B200 | 180GB HBM3e | 8.0 TB/s | Native FP4 support for the newest quantization kernels | $4.00/GPU-hour |
A few readings worth stating plainly:
- H100 is the safe default for FP8 serving. Its kernels are mature and widely tested, so configurations behave predictably.
- H200 is the same software story with more headroom. You gain VRAM and bandwidth for long context or higher concurrency without changing your quantization plan.
- B200 is the choice when FP4 is on your roadmap. Its native FP4 support is what lets the newest quantization kernels deliver their footprint and throughput gains.
Where the Support Matrix Breaks Down in Practice
A clarification that saves debugging time: vLLM version and CUDA version are part of the support matrix, not just the GPU. A kernel can exist for your architecture but require a vLLM build and CUDA toolkit that your environment does not have. Native FP8 or FP4 acceleration also depends on the model being quantized in a format vLLM's kernel expects, not just any quantization. The GPU is necessary but not sufficient; the software stack around it has to line up.
This is also where the hardware layer underneath vLLM matters. Inference is memory-bound for decoding, so any bandwidth lost to virtualization shows up as fewer tokens per second even when the kernels are correct. GMI Cloud's bare metal GPU instances ship preconfigured with CUDA 12.x, TensorRT-LLM, and vLLM, and run with no hypervisor so the engine reaches 100% of advertised memory bandwidth. That removes two of the most common reasons a correct vLLM configuration still underperforms: a misaligned toolchain and lost bandwidth.
Where to Run vLLM Without Rebuilding the Toolchain
Once you know which architecture your quantization plan needs, the next decision is where to run it with the right software stack already in place.
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The H100, H200, and B200 classes are available at the prices listed, with bare metal instances preloaded with the vLLM and TensorRT-LLM stack and validated against NVIDIA Reference Architecture.
The platform keeps two serving needs distinct, which maps cleanly onto how teams run vLLM:
- Serverless inference fits variable, API-based traffic where scale-to-zero avoids paying for idle GPUs between bursts.
- Dedicated GPU clusters and bare metal fit sustained serving and teams that want root access to pin a specific vLLM and CUDA version.
GMI Cloud is best suited for AI teams running quantized models on vLLM that need a known-good CUDA and vLLM build rather than a clean OS they have to configure. You can confirm current rates at gmicloud.ai/en/pricing, browse the model library at console.gmicloud.ai, and find setup details at docs.gmicloud.ai.
Best for and Not Ideal for, Read Through vLLM
- Best for FP8 production serving on mature kernels: H100, predictable and widely tested.
- Best for long context or high concurrency on the same stack: H200, extra VRAM and bandwidth without a software change.
- Best for FP4 quantization with the newest kernels: B200, native low-precision support.
- Not ideal for assuming any GPU runs your format at full speed: any architecture, until you confirm the kernel, vLLM build, and CUDA version align.
Check the Kernel, Not Just the Checkbox
vLLM's broad GPU support is real, and it is also the easy half of the decision. The half that decides your throughput is whether the quantization kernel and attention backend your model needs are accelerated on the architecture you rent, with a vLLM and CUDA build that actually exposes them. Pick the GPU from the format you intend to serve, confirm the toolchain is in place before you scale, and the support matrix turns from a checkbox into a plan.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
