When AI teams self-host LLMs, the GPU they choose determines the cost-performance ceiling of their entire inference stack. Pick a GPU with too little VRAM and the model won't load. Pick one with excessive compute but slow memory bandwidth and you'll bottleneck on token generation.
From GMI Cloud's experience provisioning H100 and H200 clusters for enterprise inference workloads, the most common mistake isn't choosing the "wrong" GPU. It's relying on synthetic benchmarks instead of matching hardware to your specific model size, concurrency requirements, and latency targets.
This guide covers GPU terminology, why GPU selection matters for inference, the three tiers of GPU hardware (consumer, workstation, data center), the five key selection factors, a model-to-GPU matching table, and GMI Cloud's tiered GPU product matrix for production inference.
GPU, Graphics Card, Accelerator: What's the Difference?
These terms get used interchangeably, but they mean different things. A GPU (graphics processing unit) is the chip itself, the silicon that runs parallel compute operations. A graphics card is the full board: GPU chip plus VRAM, power delivery, cooling, and PCIe or SXM connector.
An accelerator is a broader category that includes GPUs but also covers custom AI chips (Google TPUs, AWS Inferentia) designed specifically for tensor operations.
For inference, the practical distinction matters. When GMI Cloud provisions "H100 SXM" instances, that's the full accelerator module (GPU + 80 GB HBM3 + NVLink interface) installed in an SXM-format baseboard, not a standalone PCIe card.
SXM modules deliver higher memory bandwidth and GPU-to-GPU interconnect speeds than their PCIe counterparts, which directly impacts multi-GPU inference performance.
Why GPU Selection Makes or Breaks Inference Performance
A 7B parameter model in FP16 needs roughly 14 GB of VRAM just to load weights. A 70B model needs around 140 GB, more than any single GPU offers, so you need tensor parallelism across multiple GPUs. A 400B+ model requires an entire multi-GPU node.
If your GPU doesn't have enough VRAM, you're forced into aggressive quantization (which may degrade output quality) or offloading to CPU memory (which destroys latency).
But VRAM isn't the only factor. Inference is memory-bandwidth-bound during token generation: the GPU reads model weights from VRAM for every output token.
An H100 SXM with 3.35 TB/s memory bandwidth (source: NVIDIA H100 Datasheet, 2023) generates tokens significantly faster than an RTX 4090 at 1.01 TB/s, even though their raw compute specs aren't as far apart.
GMI Cloud's operational data consistently shows that teams who size GPUs on VRAM alone miss the bandwidth bottleneck and end up with higher-than-expected latency.
Three Tiers of GPU Hardware for Inference
Consumer GPUs (GeForce RTX Series)
NVIDIA's RTX 4090 (24 GB GDDR6X, 1.01 TB/s bandwidth) and RTX 5090 (32 GB GDDR7, ~1.79 TB/s bandwidth) are the top consumer options. They're affordable relative to data-center GPUs and good for prototyping, fine-tuning small models, and running 7B-13B models locally.
The limitations: VRAM caps at 24-32 GB, no NVLink support for multi-GPU scaling, and NVIDIA's EULA restricts data-center deployment of consumer GPUs.
Workstation GPUs (RTX Pro / A-Series)
The RTX 6000 Ada (48 GB GDDR6, ~960 GB/s) and A6000 (48 GB GDDR6) bridge the gap. They offer more VRAM (48 GB handles 30B models in FP16), support NVLink for 2-GPU configurations, and have no data-center deployment restrictions. They're a reasonable choice for small-team inference servers running mid-size models.
The trade-off: lower memory bandwidth than HBM-based data-center GPUs, which limits token generation speed under high concurrency.
Data-Center GPUs (H100, H200, L4)
This is where production inference lives. The key specs:
H100 SXM
- VRAM: 80 GB HBM3
- Memory BW: 3.35 TB/s
- FP8 Compute: 1,979 TFLOPS
- TDP: 700W
- Source: NVIDIA H100 Datasheet (2023)
H200 SXM
- VRAM: 141 GB HBM3e
- Memory BW: 4.8 TB/s
- FP8 Compute: 1,979 TFLOPS
- TDP: 700W
- Source: NVIDIA H200 Product Brief (2024)
L4
- VRAM: 24 GB GDDR6
- Memory BW: 300 GB/s
- FP8 Compute: 242 TFLOPS
- TDP: 72W
- Source: NVIDIA L4 Datasheet (2023)
H200's 141 GB VRAM fits a 70B FP16 model on a single GPU. Its 4.8 TB/s bandwidth delivers up to 1.9x inference speedup on Llama 2 70B versus H100 (NVIDIA official, TensorRT-LLM, FP8, batch 64, 128/2048 tokens). L4 is the energy-efficient option for smaller models and cost-sensitive deployments at just 72W TDP.
GMI Cloud's recommendation: H100 for general production inference, H200 for large-model or memory-bandwidth-critical workloads.
Five Key Factors for GPU Selection
1. VRAM (Top Priority)
VRAM determines which models you can load. Rule of thumb: model parameters x 2 bytes (FP16) = minimum VRAM. A 70B model needs ~140 GB in FP16, or ~70 GB in FP8/INT8. Always account for KV-cache overhead, which grows with batch size and sequence length.
For multi-tenant serving, budget 20-40% extra VRAM beyond model weights for KV-cache.
2. Memory Bandwidth (Critical for Token Generation)
Once the model is loaded, token generation speed is bounded by how fast the GPU reads weights from VRAM. H200's 4.8 TB/s versus L4's 300 GB/s means a 16x difference in theoretical memory read speed. For latency-sensitive applications, memory bandwidth matters more than raw compute TFLOPS.
3. Compute Throughput (FP8/FP16)
FP8 support on H100/H200 effectively doubles inference throughput versus FP16 with minimal quality loss for most LLMs. If your serving engine supports FP8 (vLLM and TensorRT-LLM both do), this is a free performance gain on compatible hardware.
4. Cost and Availability
An RTX 4090 costs ~$1,600 retail. An H100 SXM runs ~$25,000-30,000 per unit, or ~$2.10/GPU-hour via GMI Cloud. H200 runs ~$2.50/GPU-hour. For many teams, cloud GPU rental makes more financial sense than purchasing, especially when demand is variable. Check gmicloud.ai/pricing for current rates.
5. Software Ecosystem and Framework Support
CUDA dominance means NVIDIA GPUs have the broadest framework support: vLLM, TensorRT-LLM, Triton, PyTorch, and every major inference engine runs natively on CUDA. Alternative accelerators (AMD MI300X, Intel Gaudi) are improving but still have narrower ecosystem coverage.
If you're choosing non-NVIDIA hardware, verify that your target inference engine supports it before committing.
Matching Open-Source LLMs to GPUs
Llama 3.1 8B
- Params: 8B
- FP16 VRAM: ~16 GB
- FP8 VRAM: ~8 GB
- Minimum GPU: RTX 4090 (24 GB)
- Recommended (Production): L4 or 1x H100
Qwen3 32B
- Params: 32B
- FP16 VRAM: ~64 GB
- FP8 VRAM: ~32 GB
- Minimum GPU: 2x RTX 6000 (96 GB)
- Recommended (Production): 1x H100 (80 GB, FP8)
Llama 3.3 70B
- Params: 70B
- FP16 VRAM: ~140 GB
- FP8 VRAM: ~70 GB
- Minimum GPU: 2x H100 (160 GB)
- Recommended (Production): 1x H200 (141 GB) or 2x H100
DeepSeek R1 671B
- Params: 671B
- FP16 VRAM: ~1.3 TB
- FP8 VRAM: ~670 GB
- Minimum GPU: 8x H200 (1.1 TB)
- Recommended (Production): Full 8-GPU H200 node
GLM-5 (via API)
- Params: N/A
- FP16 VRAM: N/A
- FP8 VRAM: N/A
- Minimum GPU: No GPU needed
- Recommended (Production): GMI Cloud API: $1.00/M in, $3.20/M out
Notice the last row. If you don't want to manage GPU hardware at all, GMI Cloud's Model Library offers 100+ models (including GLM-5 by Zhipu AI, GPT-5, Claude, DeepSeek, Qwen) via API. GLM-5 output at $3.20/M is 68% cheaper than GPT-5 ($10.00/M).
For teams that need self-hosted inference, the table above maps models to the minimum and recommended GPU configurations, all available on GMI Cloud.
GMI Cloud's GPU Product Matrix for Inference
Mid-Tier: Cost-Efficient Inference
For teams running 7B-32B models at moderate concurrency, GMI Cloud offers single and multi-GPU H100 SXM configurations at ~$2.10/GPU-hour. The stack comes pre-configured with CUDA 12.x, vLLM, TensorRT-LLM, and Triton.
You can serve a Qwen3 32B model in FP8 on a single H100 with production-grade latency, or run Llama 3.1 8B on an L4 instance for even lower cost.
High-Tier: Large-Scale, High-Concurrency Inference
For 70B+ models and high-throughput workloads, GMI Cloud provides 8-GPU H100 and H200 nodes with NVLink 4.0 (900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms) and 3.2 Tbps InfiniBand inter-node networking.
H200 nodes at ~$2.50/GPU-hour offer 141 GB VRAM per GPU, enough to load a 70B model on a single GPU in FP16. Multi-node clusters handle 400B+ models with tensor parallelism and pipeline parallelism across nodes.
Cost Controls and Performance Guarantees
GMI Cloud supports reserved instances for predictable baselines (lower per-hour cost with commitment) and on-demand instances for burst traffic. The serving stack is pre-tuned for each GPU type, so you don't lose performance to misconfigured NCCL settings or suboptimal NVLink topology.
For teams that prefer API access over GPU management, the Model Library offers 100+ models with per-token pricing: GLM-5 at $1.00/M input and $3.20/M output, GLM-4.7-Flash at $0.07/M input and $0.40/M output. Check console.gmicloud.ai for current pricing.
FAQ
Q: What tools should I use to benchmark LLM inference on my GPU?
Start with llm-perf-bench (Hugging Face) for standardized throughput and latency metrics across model sizes. For vLLM-specific benchmarks, use vLLM's built-in benchmarking scripts with your actual prompt distribution. For TensorRT-LLM, NVIDIA's trtllm-bench measures compiled-model performance.
GMI Cloud recommends testing with your production prompt mix (not synthetic inputs) at your expected concurrency level, since benchmark scores with batch-1 synthetic prompts rarely reflect real-world throughput.
Q: Should I buy or rent GPU servers for inference?
Purchasing makes sense if you have stable, predictable demand and a 2+ year horizon (total cost of ownership becomes favorable over cloud pricing). Renting makes sense for variable demand, rapid scaling, or if you don't want to manage hardware refresh cycles.
GMI Cloud offers both on-demand (pay by the hour) and reserved instances (lower rate with commitment) at gmicloud.ai. For teams that don't want to manage GPUs at all, the Model Library API eliminates hardware management entirely.
Q: How do I check what GPU I have and whether it's suitable for inference?
On Linux, run nvidia-smi to see your GPU model, VRAM, driver version, and current utilization. On Windows, use Task Manager (Performance tab) or nvidia-smi from the command prompt.
Key things to check: total VRAM (determines which models fit), driver version (must match your CUDA toolkit), and current memory usage (available VRAM for inference). GMI Cloud's Deploy dashboard shows real-time GPU utilization, VRAM allocation, and inference metrics for all provisioned instances.
Q: How important are CUDA and driver versions for inference?
Critical. FP8 inference requires CUDA 12.0+ and compatible drivers. TensorRT-LLM requires specific CUDA/cuDNN combinations for compiled model compatibility. Version mismatches are one of the most common causes of inference failures and performance degradation.
GMI Cloud's instances come pre-configured with CUDA 12.x, cuDNN, and NCCL tuned for the specific GPU topology, eliminating version compatibility issues entirely.
How Does AI Inference Work, and How Is It


