RunPod vs Lambda vs CoreWeave vs GMI Cloud: What GPU Rental Really Costs for LLM Inference
May 28, 2026
Sticker price comparison misleads. Two providers can list the same H100 at very different rates, then quietly close the gap (or widen it by 2-3x) once you pick a billing mode. Teams pick the cheapest hourly number, then watch their actual bill balloon from cold-start fees, idle premiums, and minimum commits.
That's how a "$2.49/hr H100" turns into $4.10/hr effective. The honest comparison is effective $/GPU-hour after billing mode, utilization, and availability. This article walks through on-demand, serverless, and dedicated pricing across RunPod, Lambda, CoreWeave, and GMI Cloud, plus how to back out the effective rate.
The Direct Answer
For steady inference traffic above 30% utilization, on-demand or dedicated dedicated rentals beat serverless. For spiky workloads under 20% utilization, serverless wins despite higher per-second rates. The provider matters less than matching billing mode to traffic shape. Verify current rates on each provider's pricing page before committing.
Scope: This piece covers H100 and H200 SXM rental pricing for LLM inference. It doesn't cover training-grade reserved clusters, spot pricing volatility, or egress charges (which can swing the math further).
Three Billing Modes, Three Different Cost Curves
Before comparing logos, get the modes straight. Each provider mixes them differently.
- On-demand: Pay per GPU-hour while the instance runs. You manage start, stop, and idle time. Billing granularity varies (per-second, per-minute, per-hour).
- Serverless: Pay per second of active inference, scaled to zero when idle. You don't manage instances, but cold starts and queue latency become real costs.
- Dedicated / reserved: Commit for a term (week, month, year). Lower hourly rate. You eat idle time regardless of utilization.
The same H100 runs on all three modes. Effective cost depends on which one you pick.
Sticker Price Snapshot
Here's the publicly listed H100 SXM hourly rate across the four providers, as of recent pricing pages. Verify current rates on each provider's pricing page.
| Provider | H100 SXM On-Demand | H100 Serverless | H200 SXM On-Demand |
|---|---|---|---|
| GMI Cloud | ~$2.00/hr | not the core offer | ~$2.60/hr |
| RunPod | ~$2.69/hr (community); ~$3.35/hr (secure) | ~$0.00116/sec (~$4.18/hr active) | varies |
| Lambda Labs | ~$2.49-$3.29/hr | not offered | ~$3.29/hr |
| CoreWeave | ~$4.25/hr (H100 PCIe public list); SXM higher | not a primary tier | listed |
Rates above are pulled from each provider's public pricing pages and shift frequently. Always confirm before commitment.
Effective Cost: Steady Workload Worked Example
Assume you run a 70B-class LLM at 65% GPU utilization, 24/7, on one H100 for a month (730 hours).
| Provider | Mode | Rate | Monthly Cost |
|---|---|---|---|
| GMI Cloud H100 SXM | On-demand | $2.00/hr | ~$1,460 |
| Lambda H100 | On-demand (mid-tier) | $2.89/hr | ~$2,110 |
| RunPod H100 SXM (secure) | On-demand | $3.35/hr | ~$2,445 |
| CoreWeave H100 PCIe | On-demand | $4.25/hr | ~$3,103 |
Three things to notice. First, the lowest sticker isn't always the cheapest provider in production (you'll see why in the Engineering Reality section). Second, the spread between cheapest and most expensive is ~2.1x for identical Hopper-class silicon. Third, for steady traffic, serverless isn't even in this race because you're paying for active inference seconds that already cover 24/7.
Effective Cost: Bursty Workload Worked Example
Now assume the opposite. Same model, but 15% utilization. You only generate tokens during business hours, with quiet weekends.
| Mode | Provider | Effective Rate | Monthly Cost (15% util) |
|---|---|---|---|
| On-demand, always-on | GMI Cloud H100 | $2.00/hr always running | ~$1,460 |
| On-demand, scripted stop/start | GMI Cloud H100 | $2.00/hr × 110 active hours | ~$220 |
| Serverless | RunPod H100 active-only | ~$4.18/hr × 110 hours | ~$460 |
| Dedicated monthly commit | CoreWeave | $4.25/hr always running | ~$3,103 |
Two takeaways. Scripted stop/start on cheap on-demand beats serverless if your team can automate provisioning. Without that automation, serverless wins because the platform handles it for you. Dedicated commits are the worst choice for bursty workloads. Lower hourly, but you pay for idle.
Engineering Reality: Where the Math Breaks
The tables above assume the platform behaves. In production, it doesn't always.
Cold-start latency on serverless. RunPod serverless cold starts on an H100 can run 5-30 seconds depending on container size and model checkpoint location. If your p95 latency budget is 2 seconds, serverless cold starts will violate SLO under traffic dips. Mitigation: keep a warm worker pool, which raises effective cost toward dedicated.
Billing granularity. GMI Cloud and Lambda bill per-minute on most instance types. RunPod bills per-second on serverless. CoreWeave's commits are hourly. If you spin up an instance for a 90-second benchmark, hourly billing charges the full hour. That's a 40x premium on small jobs.
Queue times under load. When H100 supply gets tight (it usually is), on-demand requests can queue. Lambda has had publicly documented availability gaps. CoreWeave prioritizes reserved customers. GMI Cloud and RunPod allocation depends on region.
Premium GPU availability. H200 supply in 2026 is still thin across all providers. Listed prices don't help if the instance isn't available when you need it. Confirm regional availability before architecting around H200.
Network egress. None of the four bake egress into the GPU-hour rate. Heavy retrieval-augmented generation workloads can add 10-20% to the total bill. Read each provider's egress schedule.
Decision Framework
| Your traffic shape | Best billing mode | Provider considerations |
|---|---|---|
| 24/7 steady, predictable load | On-demand or dedicated commit | Lowest sticker H100/H200 rate wins. GMI Cloud and RunPod community lead here. |
| Bursty with engineering capacity to automate | On-demand + scripted stop/start | Per-minute or per-second billing matters. GMI Cloud, Lambda, RunPod. |
| Bursty without ops bandwidth | Serverless | RunPod serverless is the most mature option. |
| Long-term capacity planning, 6-12 month commit | Reserved | CoreWeave and Lambda offer the deepest commit discounts. |
| Multi-model API needs, not GPU management | Managed inference API | Skip GPU rental entirely. See GMI Cloud Inference Engine or comparable. |
Where GMI Cloud Fits in This Map
GMI Cloud (gmicloud.ai) sits in the lean on-demand H100/H200 lane with per-minute billing. Listed rates: $2.00/hr H100 SXM, $2.60/hr H200 SXM. Node config: 8 GPUs, NVLink 4.0 at 900 GB/s bidirectional aggregate per GPU on HGX, 3.2 Tbps InfiniBand between nodes. Pre-configured stacks include CUDA 12.x, TensorRT-LLM, vLLM, Triton.
Honest positioning: it doesn't currently lead in true scale-to-zero serverless, so RunPod is more mature for highly bursty workloads. CoreWeave's reserved contracts fit enterprise long-term commits. The H100/H200 sticker delta makes the case for steady or scriptable workloads.
Check gmicloud.ai/pricing for current rates.
Frequently Asked Questions
Is the cheapest H100 always the best value? No. Sticker rate ignores billing granularity, cold-start cost, and availability. A $2.00/hr H100 that's available with per-minute billing usually beats a $1.80/hr H100 with hourly billing and tight regional supply. Calculate effective $/GPU-hour after utilization and ops overhead.
When does serverless GPU beat on-demand? Serverless beats on-demand when utilization is under ~20% and your team can't automate instance start/stop. Above 30% steady utilization, on-demand at a lower sticker rate almost always wins. The crossover depends on cold-start budget and idle policy.
Should I commit to a reserved GPU contract? Only if you've proven steady 24/7 demand for 6+ months and have headroom to grow into the commit. Reserved discounts run 30-50% below on-demand on most providers, but unused capacity is a sunk cost. Start on-demand, measure, then commit.
Does GPU rental pricing include networking and storage? Usually not. Egress, persistent storage, and inter-region traffic are billed separately on all four providers covered here. For RAG or multi-modal workloads, network and storage can add 10-20% to the GPU bill. Read each provider's full schedule before forecasting.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
