Can I run Llama 405B on fewer than 4 H200s?

Not in production. Three H200s (423GB total) barely fit the weights, leaving negligible headroom for batch processing. Four H200s enables batch sizes of 8-16 with reasonable latency. Some teams prototype on 4 H100s (320GB total) and accept lower throughput as a tradeoff.

Why is DeepSeek V3 more expensive than Llama 405B on the same hardware?

Same VRAM floor, but V3's 37B active parameters per token keeps GPU compute saturated. Llama 405B's dense compute is bandwidth-bound at high batch sizes. In practice, V3 achieves 15-20% higher throughput on the same H200 cluster due to sparser computation.

Should I quantize Qwen models differently than Llama?

FP8 quantization works similarly. It's worth considering a test run on your actual workload: sample 500-1000 prompts in your language, measure perplexity or output quality, then decide if FP8 is acceptable. Qwen typically tolerates quantization slightly better than Llama due to architecture differences.

What's the cheapest way to serve Qwen 72B in production?

A single H100 running vLLM with FP8 and batch size 32 costs roughly $2.10/hour. At 30 requests/second, that's $0.35-0.45/1K tokens self-hosted. Managed Inference Engine usually charges $0.15-0.25/1K tokens; the tradeoff is engineering time vs. operational overhead.

DeepSeek, Llama, Qwen: Which GPU Runs Each One Best

April 30, 2026

Different models have different hardware bottlenecks. A GPU that's optimal for Llama 70B might be overkill for Qwen 7B and insufficient for DeepSeek V3. The challenge isn't just VRAM: it's memory bandwidth, compute density, and whether the model uses mixture-of-experts. GMI Cloud provides GPU instances sized for each model class, letting teams match hardware to architecture rather than guessing.

This article covers: why model architecture dictates GPU choice, deployment recipes for Llama/DeepSeek/Qwen families, a master lookup table, and infrastructure specs for self-hosted deployments.

Why Model Architecture Dictates GPU Choice

Model size alone doesn't determine GPU fit. Parameter count sets a VRAM floor: Llama 70B in FP8 needs roughly 35GB of weights. But attention mechanism matters just as much. Standard multi-head attention (MHA) is bandwidth-hungry; grouped query attention (GQA) and multi-query attention (MQA) reduce that load significantly.

Mixture-of-experts models like DeepSeek V3 add another layer. The model declares 671B total parameters, but only ~37B activate per token. That sounds efficient until you realize the GPU still must load all weights into memory to route traffic across experts. This inverts the typical VRAM-compute tradeoff: more VRAM becomes the bottleneck, not more compute.

These architectural differences mean a single GPU size rarely fits all models in a family. The next sections break down each popular model class and suggest hardware pairings based on real deployment constraints.

Llama Family: 8B, 70B, 405B

Llama models use standard dense architecture with multi-head attention. They're well-understood and widely deployed. The progression from 8B to 405B roughly tracks linearly with VRAM requirements when precision is fixed.

Llama 3 8B typically runs on a single L4 (24GB GDDR6). Loading weights takes roughly 8GB (FP8), leaving 16GB for batch processing and KV-cache. A common approach is to run FP8 quantization with vLLM, which achieves similar quality to FP16 while halving memory footprint. Single-card deployments work well for latency-sensitive applications under moderate load.

Llama 3 70B requires more bandwidth and memory. An H100 SXM (80GB HBM3, 3.35 TB/s) holds weights plus active KV-cache comfortably. Weights in FP8 occupy roughly 35GB, leaving 45GB for batch state. Most teams find that TensorRT-LLM with FP8 quantization achieves 2x throughput versus unquantized FP16 on the same hardware. This is the "sweet spot" for high-throughput inference with acceptable latency.

Llama 3 405B is the jump most teams haven't prepared for. A single H200 (141GB HBM3e, 4.8 TB/s) still won't fit the full model in FP8 (roughly 162GB of weights alone). Tensor parallelism across 4 H200s (564GB aggregate) provides the needed VRAM and bandwidth. The high bidirectional bandwidth of NVLink (900 GB/s per GPU in an 8-GPU node) keeps communication overhead under 5% when using efficient collective kernels like those in NCCL.

DeepSeek Family: V2, V3, R1

DeepSeek models are mixture-of-experts architectures with dense attention layers. V2 has 236B total parameters with 21B active. V3 scales to 671B total with 37B active per token. The crucial insight: the GPU must load all 671B parameters to route tokens across experts, even though only 37B compute per token.

This flips VRAM priority upside down. A single H200 (141GB) cannot fit V3. Two H200s (282GB) still won't; you need 4 H200s (564GB) to load all weights in FP8 and leave headroom for KV-cache and batch state. At that scale, expert parallelism strategies matter: some experts live on GPU 0, others on GPU 1, etc. NVLink's 900 GB/s aggregate bandwidth between GPUs becomes essential to avoid communication bottlenecks when a token must route to an expert on a different GPU.

V2 is more forgiving. Its 236B total footprint fits on 2 H200s with room for inference. One option is to use a single H100 for prototyping, accepting lower throughput, then scale to 2 H100s once traffic demands it. The transition is relatively smooth because the expert routing logic remains the same.

Qwen Family: 7B, 72B, 110B

Qwen models use dense architecture similar to Llama, with dense attention throughout. They follow the same VRAM scaling patterns: 7B on L4, 72B on H100, and 110B on H200. There's a critical difference in practice though: Qwen's tokenizer is highly optimized for Chinese language, achieving 15-20% higher throughput on Chinese input compared to Llama on the same hardware.

For teams serving primarily English, Qwen 72B and Llama 70B perform similarly on H100. For multilingual or Chinese-heavy workloads, Qwen's efficiency gain justifies the hardware choice. It's worth considering if your user base spans Asia-Pacific regions.

Qwen 110B pushes into H200 territory (141GB HBM3e). The 4.8 TB/s memory bandwidth combined with FP8 precision allows batch sizes of 32-64 to be processed in the 50-150ms latency target for interactive applications.

Master Lookup Table

This table combines model size, recommended precision, VRAM floor, GPU recommendation, and estimated cost for self-hosted inference at scale.

Model	Size	Precision	Min VRAM	Recommended GPU	Approx Cost/1K Tokens	Deployment Option
Llama 3 8B	8B	FP8	12GB	L4	$0.05-0.12	Single L4, $0.30/hr
Llama 3 70B	70B	FP8	35GB	H100 SXM	$0.30-0.50	Single H100, $2.10/hr
Llama 3 405B	405B	FP8	162GB	4×H200	$2.00-3.50	4×H200 node, $10.00/hr
DeepSeek V2	236B	FP8	118GB	2×H200	$1.20-2.00	2×H200, $5.00/hr
DeepSeek V3	671B	FP8	336GB	4×H200	$3.50-5.50	4×H200 node, $10.00/hr
Qwen 7B	7B	FP8	12GB	L4	$0.05-0.12	Single L4, $0.30/hr
Qwen 72B	72B	FP8	36GB	H100 SXM	$0.30-0.50	Single H100, $2.10/hr
Qwen 110B	110B	FP8	55GB	H200 SXM	$0.50-0.80	Single H200, $2.50/hr

Prices reflect self-hosted GPU hourly rates (check gmicloud.ai/pricing for current rates). For teams without GPU infrastructure, the Inference Engine column indicates whether a managed inference platform offers the model. Many organizations find that the engineering overhead of running GPUs outweighs the per-token savings once operations costs are factored in.

GMI Cloud Infrastructure for Model-GPU Pairing

GMI Cloud is one option for teams looking to match these models to specific GPU hardware. At the time of writing, listed infrastructure includes:

Single-GPU options: L4 (24GB GDDR6, 300 GB/s), H100 SXM (80GB HBM3, 3.35 TB/s, ~$2.10/GPU-hour), H200 SXM (141GB HBM3e, 4.8 TB/s, ~$2.50/GPU-hour).

Multi-GPU nodes: 8×H100 or 8×H200 with NVLink 4.0 (900 GB/s bidirectional per GPU) and 3.2 Tbps InfiniBand for distributed workloads. Pre-installed runtime includes TensorRT-LLM, vLLM, Triton, CUDA 12.x, cuDNN, and NCCL.

Inference Engine alternative: 100+ pre-deployed models available at $0.000001-$0.50/request. No GPU management required, useful for prototyping or uneven traffic patterns.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started