DeepSeek, Llama, Qwen: Which GPU Runs Each One Best
April 30, 2026
Different models have different hardware bottlenecks. A GPU that's optimal for Llama 70B might be overkill for Qwen 7B and insufficient for DeepSeek V3. The challenge isn't just VRAM: it's memory bandwidth, compute density, and whether the model uses mixture-of-experts. GMI Cloud provides GPU instances sized for each model class, letting teams match hardware to architecture rather than guessing.
This article covers: why model architecture dictates GPU choice, deployment recipes for Llama/DeepSeek/Qwen families, a master lookup table, and infrastructure specs for self-hosted deployments.
Why Model Architecture Dictates GPU Choice
Model size alone doesn't determine GPU fit. Parameter count sets a VRAM floor: Llama 70B in FP8 needs roughly 35GB of weights. But attention mechanism matters just as much. Standard multi-head attention (MHA) is bandwidth-hungry; grouped query attention (GQA) and multi-query attention (MQA) reduce that load significantly.
Mixture-of-experts models like DeepSeek V3 add another layer. The model declares 671B total parameters, but only ~37B activate per token. That sounds efficient until you realize the GPU still must load all weights into memory to route traffic across experts. This inverts the typical VRAM-compute tradeoff: more VRAM becomes the bottleneck, not more compute.
These architectural differences mean a single GPU size rarely fits all models in a family. The next sections break down each popular model class and suggest hardware pairings based on real deployment constraints.
Llama Family: 8B, 70B, 405B
Llama models use standard dense architecture with multi-head attention. They're well-understood and widely deployed. The progression from 8B to 405B roughly tracks linearly with VRAM requirements when precision is fixed.
Llama 3 8B typically runs on a single L4 (24GB GDDR6). Loading weights takes roughly 8GB (FP8), leaving 16GB for batch processing and KV-cache. A common approach is to run FP8 quantization with vLLM, which achieves similar quality to FP16 while halving memory footprint. Single-card deployments work well for latency-sensitive applications under moderate load.
Llama 3 70B requires more bandwidth and memory. An H100 SXM (80GB HBM3, 3.35 TB/s) holds weights plus active KV-cache comfortably. Weights in FP8 occupy roughly 35GB, leaving 45GB for batch state. Most teams find that TensorRT-LLM with FP8 quantization achieves 2x throughput versus unquantized FP16 on the same hardware. This is the "sweet spot" for high-throughput inference with acceptable latency.
Llama 3 405B is the jump most teams haven't prepared for. A single H200 (141GB HBM3e, 4.8 TB/s) still won't fit the full model in FP8 (roughly 162GB of weights alone). Tensor parallelism across 4 H200s (564GB aggregate) provides the needed VRAM and bandwidth. The high bidirectional bandwidth of NVLink (900 GB/s per GPU in an 8-GPU node) keeps communication overhead under 5% when using efficient collective kernels like those in NCCL.
DeepSeek Family: V2, V3, R1
DeepSeek models are mixture-of-experts architectures with dense attention layers. V2 has 236B total parameters with 21B active. V3 scales to 671B total with 37B active per token. The crucial insight: the GPU must load all 671B parameters to route tokens across experts, even though only 37B compute per token.
This flips VRAM priority upside down. A single H200 (141GB) cannot fit V3. Two H200s (282GB) still won't; you need 4 H200s (564GB) to load all weights in FP8 and leave headroom for KV-cache and batch state. At that scale, expert parallelism strategies matter: some experts live on GPU 0, others on GPU 1, etc. NVLink's 900 GB/s aggregate bandwidth between GPUs becomes essential to avoid communication bottlenecks when a token must route to an expert on a different GPU.
V2 is more forgiving. Its 236B total footprint fits on 2 H200s with room for inference. One option is to use a single H100 for prototyping, accepting lower throughput, then scale to 2 H100s once traffic demands it. The transition is relatively smooth because the expert routing logic remains the same.
Qwen Family: 7B, 72B, 110B
Qwen models use dense architecture similar to Llama, with dense attention throughout. They follow the same VRAM scaling patterns: 7B on L4, 72B on H100, and 110B on H200. There's a critical difference in practice though: Qwen's tokenizer is highly optimized for Chinese language, achieving 15-20% higher throughput on Chinese input compared to Llama on the same hardware.
For teams serving primarily English, Qwen 72B and Llama 70B perform similarly on H100. For multilingual or Chinese-heavy workloads, Qwen's efficiency gain justifies the hardware choice. It's worth considering if your user base spans Asia-Pacific regions.
Qwen 110B pushes into H200 territory (141GB HBM3e). The 4.8 TB/s memory bandwidth combined with FP8 precision allows batch sizes of 32-64 to be processed in the 50-150ms latency target for interactive applications.
Master Lookup Table
This table combines model size, recommended precision, VRAM floor, GPU recommendation, and estimated cost for self-hosted inference at scale.
| Model | Size | Precision | Min VRAM | Recommended GPU | Approx Cost/1K Tokens | Deployment Option |
|---|---|---|---|---|---|---|
| Llama 3 8B | 8B | FP8 | 12GB | L4 | $0.05-0.12 | Single L4, $0.30/hr |
| Llama 3 70B | 70B | FP8 | 35GB | H100 SXM | $0.30-0.50 | Single H100, $2.10/hr |
| Llama 3 405B | 405B | FP8 | 162GB | 4×H200 | $2.00-3.50 | 4×H200 node, $10.00/hr |
| DeepSeek V2 | 236B | FP8 | 118GB | 2×H200 | $1.20-2.00 | 2×H200, $5.00/hr |
| DeepSeek V3 | 671B | FP8 | 336GB | 4×H200 | $3.50-5.50 | 4×H200 node, $10.00/hr |
| Qwen 7B | 7B | FP8 | 12GB | L4 | $0.05-0.12 | Single L4, $0.30/hr |
| Qwen 72B | 72B | FP8 | 36GB | H100 SXM | $0.30-0.50 | Single H100, $2.10/hr |
| Qwen 110B | 110B | FP8 | 55GB | H200 SXM | $0.50-0.80 | Single H200, $2.50/hr |
Prices reflect self-hosted GPU hourly rates (check gmicloud.ai/pricing for current rates). For teams without GPU infrastructure, the Inference Engine column indicates whether a managed inference platform offers the model. Many organizations find that the engineering overhead of running GPUs outweighs the per-token savings once operations costs are factored in.
GMI Cloud Infrastructure for Model-GPU Pairing
GMI Cloud is one option for teams looking to match these models to specific GPU hardware. At the time of writing, listed infrastructure includes:
Single-GPU options: L4 (24GB GDDR6, 300 GB/s), H100 SXM (80GB HBM3, 3.35 TB/s, ~$2.10/GPU-hour), H200 SXM (141GB HBM3e, 4.8 TB/s, ~$2.50/GPU-hour).
Multi-GPU nodes: 8×H100 or 8×H200 with NVLink 4.0 (900 GB/s bidirectional per GPU) and 3.2 Tbps InfiniBand for distributed workloads. Pre-installed runtime includes TensorRT-LLM, vLLM, Triton, CUDA 12.x, cuDNN, and NCCL.
Inference Engine alternative: 100+ pre-deployed models available at $0.000001-$0.50/request. No GPU management required, useful for prototyping or uneven traffic patterns.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
