Compare GPU Cloud Pricing for LLM Inference Workloads
April 08, 2026
LLM inference cost comes down to three variables: model size, throughput, and GPU memory efficiency. If you're trying to figure out which GPU cloud tier makes sense for your chatbot, RAG pipeline, or API backend, those three factors tell you more than any hourly rate table.
If you're currently overpaying, it's usually because you're sizing to worst-case VRAM needs instead of optimizing batch utilization.
GMI Cloud offers H100 SXM and H200 SXM with pre-configured TensorRT-LLM and vLLM environments, so you can benchmark your specific workload before committing.
The Cost Structure of LLM Serving
Serving LLMs efficiently isn't like running a standard web API. Three resource constraints shape your cost profile: VRAM capacity, decode throughput, and request concurrency.
VRAM capacity determines which models you can run at all. Llama 3 70B in FP16 needs roughly 140 GB of VRAM — that's the H200's full 141 GB. In FP8 or INT8 quantization, you can fit it on two H100s or one H200. Model weights are just the starting point; KV-cache memory grows with sequence length and concurrent requests.
KV-cache memory per request can be estimated with this formula: KV per request = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element. For Llama 2 70B (80 layers, 8 KV heads, 128 head_dim) in FP16 at a 4K context, that's approximately 0.4 GB per concurrent request.
If you're handling 20 simultaneous requests, you need 8 GB just for KV-cache on top of the model weights.
Decode throughput governs your tokens-per-second output rate. It's almost always memory-bandwidth-bound, not compute-bound.
This is why a GPU with higher memory bandwidth — like the H200 at 4.8 TB/s versus the H100 at 3.35 TB/s — delivers faster decode even though their FP16 compute specs are identical (Source: NVIDIA H200 Tensor Core GPU Product Brief, 2024; NVIDIA H100 Tensor Core GPU Datasheet, 2023).
That leads directly to how you should think about cost per output token.
The $/Output-Token Framework
Hourly GPU rates are a starting point, not a final answer. What you actually care about is cost per 1,000 output tokens under your real-world load. Here's how to build that number.
Start with your GPU's tokens-per-second for your specific model and batch size. Divide your hourly GPU cost by 3,600 to get cost per second. Divide that by tokens-per-second to get cost per token. Multiply by 1,000 to get $/1K tokens.
For example: an H100 at $2.00/hour serving Llama 2 70B at 50 tokens/second costs $0.00055/second. That works out to $0.011 per 1,000 output tokens at 50 tokens/sec.
If you can push batch size higher and reach 120 tokens/second through better batching or FP8, your cost drops to $0.005 per 1,000 tokens — a 2.4x improvement with no hardware change. Check gmicloud.ai/pricing for current rates.
GPU Comparison for LLM Workloads
The table below compares GPU options for LLM inference, ranked by suitability for production serving. Specs sourced from official NVIDIA datasheets and product briefs.
| Rank | GPU | VRAM | Memory BW | Best For | Approx. $/GPU-hr |
|---|---|---|---|---|---|
| 1 | H200 SXM | 141 GB HBM3e | 4.8 TB/s | 70B+ models, long context, high concurrency | ~$2.60 |
| 2 | H100 SXM | 80 GB HBM3 | 3.35 TB/s | 7B–70B models, production serving, FP8 inference | ~$2.00 |
| 3 | A100 80GB | 80 GB HBM2e | 2.0 TB/s | Legacy workloads, cost-sensitive batched inference | ~$1.65 avg |
| 4 | L4 | 24 GB GDDR6 | 300 GB/s | Sub-7B models, low-traffic endpoints | ~$0.65 avg |
Check gmicloud.ai/pricing for current rates. Sources: NVIDIA H100 Tensor Core GPU Datasheet (2023); NVIDIA H200 Tensor Core GPU Product Brief (2024); NVIDIA A100 Tensor Core GPU Datasheet; NVIDIA L4 Tensor Core GPU Datasheet.
The H200's edge over the H100 is purely memory-driven. Both deliver 989 FP16 TFLOPS and 3,958 INT8 TOPS. But the H200 achieves up to 1.9x inference speedup on Llama 2 70B versus the H100 (NVIDIA official benchmark, TensorRT-LLM, FP8, batch 64, 128/2048 tokens — NVIDIA H200 Tensor Core GPU Product Brief, 2024).
For context windows longer than 4K tokens or batch sizes above 32, the bandwidth gap compounds fast. Both support NVLink 4.0 at 900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms, which matters for tensor-parallel inference across multiple GPUs.
When Variable Traffic Makes GPU Instances Expensive
Here's the thing about renting GPU instances for inference: you pay for the GPU whether it's serving requests or sitting idle. That's fine at 70–80% utilization. It's painful at 20–30%, which is exactly what most teams face during off-peak hours or early product stages.
If your traffic follows a diurnal pattern — busy during business hours, quiet at night — you're paying for idle capacity roughly 40–50% of the time. At $2.00/hour for an H100, 12 hours of idle capacity per day costs over $700/month in pure waste.
On-demand instances help but still require you to manage provisioning and cold-start latency when traffic spikes.
This is why inference API platforms exist as a category. Instead of paying per GPU-hour, you pay per request — only when work is actually happening. For teams with unpredictable or moderate traffic, the economics shift significantly in favor of API-based inference.
Inference Engine API as a Cost-Effective Alternative
Managed inference APIs eliminate the GPU idle problem entirely. You call an endpoint, pay for the request, and don't think about utilization. For early-stage products, prototypes, or workloads with high variance in traffic, this model often beats self-hosted GPU instances by a significant margin.
GMI Cloud's Inference Engine offers 100+ pre-deployed models via API with no GPU provisioning required.
Pricing ranges from $0.000001 to $0.50 per request depending on the model and modality (GMI Cloud Inference Engine page, snapshot 2026-03-03; check gmicloud.ai for current availability and pricing).
You're not managing CUDA drivers, scaling policies, or monitoring dashboards — you're calling an API and shipping product.
The tradeoff is control. With a managed API, you can't fine-tune serving parameters, adjust batching strategy, or run custom quantization. If you need those knobs, GPU instances are the right tool.
But for teams that are still validating a product or running variable-traffic APIs, the managed path removes weeks of infrastructure work.
How to Pick the Right Tier
Your decision should flow from your model size and traffic pattern, not from the cheapest available GPU.
If you're running models larger than 70B in full precision, or if you're serving 50+ concurrent users with sequences longer than 4K tokens, start with the H200. Its 141 GB HBM3e and 4.8 TB/s bandwidth are purpose-built for that regime.
The premium over H100 pays back through higher throughput and better KV-cache capacity.
If you're in the 7B–70B range with typical batch sizes and context windows under 8K, the H100 SXM is your best cost-performance option. Its 3.35 TB/s bandwidth and native FP8 support at 1,979 TFLOPS make it highly efficient for modern quantized models (Source: NVIDIA H100 Tensor Core GPU Datasheet, 2023).
Plus, if you're using TensorRT-LLM or vLLM with dynamic batching, you'll extract more throughput per dollar than you would on an A100.
For sub-7B models with low concurrency, or for internal tooling and demos, the L4 or a managed inference API is probably cheaper than running a full H100 instance at low utilization.
FAQ
What's the cheapest way to run Llama 3 70B inference? That depends on your traffic. At high utilization (60%+), a single H200 SXM typically delivers the best cost-per-token for 70B models. At low or unpredictable traffic, a managed inference API avoids GPU idle costs entirely.
Check gmicloud.ai/pricing for current rates on both paths.
How much VRAM does Llama 3 70B need? In FP16, approximately 140 GB — right at the H200's limit. In FP8 (using TensorRT-LLM or vLLM's FP8 mode), you can fit it in around 70–80 GB, which means a single H100 SXM with some headroom for KV-cache.
Does the A100 still make sense for LLM inference? For smaller models (7B–13B) where you're optimizing for cost over throughput, yes. For anything 30B and above, or workloads requiring FP8 for speed, the H100 typically delivers better cost-per-token despite the higher hourly rate.
The A100's 2.0 TB/s bandwidth bottlenecks decode on large models (Source: NVIDIA A100 Tensor Core GPU Datasheet).
Is the H200 worth the extra cost over H100? For 70B+ models or long-context workloads, the H200's 1.9x inference speedup on Llama 2 70B (NVIDIA H200 Tensor Core GPU Product Brief, 2024) means you serve roughly 1.9x more tokens per GPU-hour.
The ~19% price premium often results in lower cost-per-token for these workloads.
What's the difference between a GPU instance and an inference API? A GPU instance gives you a dedicated GPU to configure and serve however you want. An inference API gives you a pre-deployed model endpoint where you pay per request.
Instances win when you need customization and have high, steady utilization. APIs win when traffic is variable or you want to ship without infrastructure work.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
