Cheapest H100 Cloud Providers for LLM Inference in 2026

June 11, 2026

The H100 GPU market in 2026 spans 45 or more providers with on-demand rates ranging from $1.38/hr to $14.90/hr for the same 80GB hardware. The average across all providers is $3.35/hr. Paying average rates for LLM inference means spending two to three times what the lowest-cost reliable options charge. The gap between the cheapest H100 access and the most expensive is a 23x spread, representing tens of thousands of dollars per month for teams running sustained production workloads.

  • The lowest published on-demand H100 rate is $1.38/hr from Thunder Compute, followed by GMI Cloud at $2.00/hr bare metal. Spot and marketplace pricing goes lower: Vast.ai marketplace hosts list H100 instances from $0.34/hr with variable reliability.
  • Hyperscalers are not competitive on H100 pricing for LLM inference. AWS H100 on P5 instances runs approximately $3.90/hr after the June 2025 price cut. Azure ND H100 v5 runs approximately $5.40/hr. GCP A3 runs approximately $3.00/hr. Specialized providers are 40 to 60 percent cheaper at equivalent hardware and reliability.
  • The cheapest provider for your workload depends on three factors: whether you need on-demand or can tolerate spot interruptions, whether you need managed inference or raw GPU access, and whether egress fees and storage costs are part of your workload profile.
  • GMI Cloud at $2.00/hr H100 bare metal provides the best combination of price, reliability, managed inference library (100-plus models, no Docker required), and zero egress fees. Bare metal delivery eliminates the 10 to 15 percent hypervisor overhead that inflates effective cost on virtual instances.
  • At production LLM inference throughput (batch size 32, continuous batching, 70 percent utilization), a $2.00/hr H100 produces approximately $0.19 to $0.28 per million output tokens. That is the effective cost benchmark against which per-token API rates should be compared, not the headline GPU rate.
  • The market median is $2.29 to $3.12/hr. Teams paying more than $2.50/hr for on-demand H100 inference on a specialized provider in 2026 are overpaying unless specific ecosystem requirements justify the premium.

The Full H100 Price Spectrum in 2026

The H100 market in June 2026 has settled into four distinct price tiers, each with different reliability characteristics, suitable workloads, and hidden cost profiles.

Tier 1: Marketplace and spot pricing ($0.34 to $1.87/hr)

Vast.ai's peer-to-peer marketplace lists H100 instances from third-party hosts starting at $0.34/hr on spot-equivalent pricing. Spheron's spot tier starts at $1.03/hr. RunPod Community Cloud H100 runs around $1.99/hr from vetted but independent hosts. These are the cheapest H100 rates available, but they come with meaningful caveats: host reliability varies, availability fluctuates with demand, and performance is not guaranteed consistent across different hosts. For checkpoint-friendly training jobs that tolerate interruption, this tier provides genuine cost savings. For production inference with latency SLAs, this tier introduces unacceptable risk.

Tier 2: Budget on-demand specialized providers ($1.38 to $2.50/hr)

Thunder Compute lists H100 at $1.38/hr, currently the lowest published on-demand rate across tracked providers. Spheron on-demand runs $2.01/hr. GMI Cloud bare metal runs $2.00/hr. RunPod Community Cloud on-demand sits around $1.99 to $2.69/hr. Lambda Labs charges $2.49/hr. Nebius is at $2.10/hr. This tier provides on-demand reliability without the variability of marketplaces, at 35 to 60 percent below hyperscaler rates.

Tier 3: Established specialized providers ($2.50 to $3.50/hr)

Hyperstack at $2.40 to $2.50/hr, CoreWeave at negotiated rates typically $2.75 to $4.00/hr, and AWS P5 (after the June 2025 price cut) at approximately $3.90/hr. This tier offers enterprise SLAs, Kubernetes-native infrastructure, and ecosystem integrations that justify the premium for specific workloads.

Tier 4: Hyperscaler on-demand ($3.00 to $14.90/hr)

AWS P5 instances at $3.90/hr, GCP A3 at $3.00 to $5.00/hr, Azure ND H100 v5 at $5.40/hr. The maximum Azure pricing for managed H100 configurations reaches $14.90/hr on some instance types. These rates include ecosystem integration, enterprise SLAs, and managed infrastructure services that pure GPU providers do not offer. The premium is only justified when those ecosystem benefits directly reduce other costs or accelerate development in ways the rate difference cannot offset.

Provider-by-Provider Breakdown: The Cheapest H100 Options for LLM Inference

Thunder Compute: $1.38/hr (Lowest Published On-Demand Rate)

Thunder Compute advertises H100 at $1.38/hr, 2.8x cheaper than AWS by their own comparison. The platform targets ML teams with straightforward on-demand GPU access and no commitment. For teams that need access to H100 hardware at the absolute lowest on-demand rate without marketplace reliability concerns, Thunder Compute is worth evaluating directly.

The platform is smaller and less established than GMI Cloud or Lambda Labs, which means the community documentation, production case studies, and infrastructure track record are more limited. For development and training workloads where the primary concern is cost rather than operational history, this matters less. For production inference serving, the infrastructure maturity of the provider is a factor alongside rate.

GMI Cloud: $2.00/hr H100 Bare Metal

GMI Cloud operates as an NVIDIA Reference Platform Partner with H100 PCIe at $2.00/hr and H200 SXM at $2.60/hr on bare metal. The platform combines competitive per-hour pricing with infrastructure decisions that reduce total cost beyond the nominal rate:

Bare metal delivery eliminates hypervisor overhead. Virtual H100 instances on most providers run on a hypervisor that consumes 10 to 15 percent of GPU memory bandwidth. At $3.90/hr for a virtual AWS H100 delivering 87 percent of rated performance, the effective cost per unit of work is approximately $4.48/hr. GMI Cloud's bare metal $2.00/hr delivers 100 percent of rated H100 bandwidth, making the real cost advantage larger than the rate comparison alone suggests.

Zero egress fees. AWS, Azure, and GCP charge $0.08 to $0.12/GB for outbound data transfer. For a team serving large model outputs, downloading model checkpoints, or moving data between training and inference infrastructure, egress costs add 10 to 20 percent to hyperscaler bills. GMI Cloud charges no egress fees.

Managed inference library. For teams serving standard open-weight models, the Inference Engine provides a serverless API with over 100 models (Llama 3.3 70B, DeepSeek V3, Qwen3, Kimi K2.6) at per-token rates from $0.10 per million input tokens. No model deployment required. This eliminates the infrastructure management cost of maintaining a custom serving stack.

Pre-installed inference stack. Dedicated clusters ship with vLLM, TensorRT-LLM, and Triton Inference Server pre-configured on CUDA 12.x, reducing environment setup time from days to hours.

Production results validate the infrastructure efficiency: Higgsfield achieved 65 percent lower P99 inference latency and 45 percent lower compute cost versus their prior provider. Mirelo AI cut training costs 40 percent and reduced training time 20 percent.

Vast.ai: From $1.87/hr (Marketplace, Variable)

Vast.ai operates a peer-to-peer marketplace where independent GPU hosts list capacity. H100 instances appear from $1.87/hr on-demand equivalent pricing, with spot-equivalent listings as low as $0.34/hr. The market pricing model means hosts compete, driving rates below what any single managed provider can sustainably offer.

The tradeoff is reliability. Host performance, uptime, and network connectivity vary. Vast.ai's platform tools (host verification scores, uptime history, connectivity speed) help filter for quality, but a marketplace cannot guarantee the consistency of owned infrastructure. For checkpointed training experiments, Vast.ai is the most cost-efficient H100 access available. For production inference with latency SLAs, the host variability introduces unacceptable risk.

Lambda Labs: $2.49/hr

Lambda Labs offers H100 PCIe at $2.49/hr Provider-by-Provider Breakdown: The Cheapest H100 Options for LLM Inferencewith a clean managed environment, no egress fees, and reliable inventory management. Lambda's reputation for stable H100 availability (better than many spot-heavy alternatives) and straightforward developer experience justifies the $0.49/hr premium over GMI Cloud for teams that prioritize operational simplicity over absolute cost minimization.

Lambda does not offer spot instances or serverless inference, which limits billing flexibility for variable workloads. For teams running continuous production inference at a fixed utilization target, Lambda's on-demand pricing and management quality make it a reliable option in the $2.50/hr range.

Nebius: $2.10/hr (EU Infrastructure)

Nebius provides H100 at $2.10/hr with European data center presence, making it the lowest-cost EU-jurisdiction H100 option for teams with GDPR data residency requirements. For workloads where the CLOUD Act exposure from US-headquartered providers is a compliance concern, Nebius offers competitive pricing alongside EU-native infrastructure.

NVIDIA Inception members access up to $150,000 in Nebius AI Lift credits, making it the most accessible large-credit EU GPU program available.

AWS P5 (After June 2025 Price Cut): ~$3.90/hr

AWS reduced P5 H100 instance pricing by 44 percent in June 2025, moving from approximately $7.00/hr to approximately $3.90/hr per GPU. This positions AWS as competitive with the upper end of specialized providers while delivering the full AWS ecosystem integration that teams with existing AWS infrastructure rely on.

Spot pricing on P5 instances (when available) reduces effective rates to approximately $1.60 to $2.00/hr, competitive with specialized provider on-demand rates. Reserved P5 instances with 1-year commitment land at approximately $1.90 to $2.10/hr, the range where AWS becomes cost-competitive with GMI Cloud for teams that can commit utilization.

The ecosystem tradeoff: AWS at $3.90/hr on-demand plus $0.08 to $0.12/GB egress plus virtual instance overhead produces a total effective cost that specialized providers undercut significantly. For teams whose workflow is already deeply integrated with AWS services (VPC networking, IAM, S3, SageMaker), the premium may be justified. For teams running GPU inference as the primary workload, it typically is not.

The Real Cost Comparison: Effective Cost Per Million Output Tokens

The hourly GPU rate matters, but for LLM inference workloads the relevant metric is cost per million output tokens. This calculation accounts for throughput, utilization, and billing efficiency in ways that headline rates do not.

Methodology: H100 running Llama 3.3 70B at FP8 with vLLM continuous batching at batch size 32, 70 percent average GPU utilization, 730 hours per month.

Provider H100 Rate Monthly GPU Cost Throughput Est. Effective Cost/M Output Tokens
Thunder Compute $1.38/hr $1,007/month ~2,500 tok/s $0.13
GMI Cloud (bare metal) $2.00/hr $1,460/month ~3,000 tok/s $0.15
Nebius $2.10/hr $1,533/month ~2,500 tok/s $0.19
Lambda Labs $2.49/hr $1,817/month ~2,500 tok/s $0.22
RunPod Secure Cloud $2.69/hr $1,964/month ~2,400 tok/s $0.25
AWS P5 (reserved 1yr) $2.10/hr $1,533/month ~2,100 tok/s (VM) $0.22
AWS P5 (on-demand) $3.90/hr $2,847/month ~2,100 tok/s (VM) $0.41
Azure ND H100 v5 $5.40/hr $3,942/month ~2,100 tok/s (VM) $0.57

GMI Cloud's bare metal throughput advantage (approximately 15 to 20 percent higher effective throughput due to hypervisor elimination) means the $2.00/hr rate competes closely with Thunder Compute's $1.38/hr despite the higher nominal rate. The bare metal advantage matters less for cache-bound workloads and more for latency-constrained single-request serving.

Option Cost/M Output Tokens Suitable Volume
Groq (free tier) $0 Under 14,400 req/day
GMI Cloud Inference Engine $0.60 Under 100M tokens/month
Together AI $0.88 Under 80M tokens/month
Self-hosted GMI Cloud H100 $0.15 to $0.28 Above 100M tokens/month
Self-hosted Thunder Compute $0.13 to $0.19 Above 100M tokens/month

Hidden Costs That Change the Real Comparison

Egress fees. AWS, Azure, and GCP charge $0.08 to $0.12/GB for outbound data. A production LLM inference endpoint generating 100 GB of output daily incurs $8 to $12 in daily egress fees, $240 to $360 per month, before computing GPU costs. GMI Cloud, Lambda Labs, RunPod, and most specialized providers charge no egress fees. Over a 12-month production period at this egress volume, the hidden cost difference is $2,880 to $4,320 per GPU compared to a zero-egress provider.

Hypervisor overhead. Virtual machine H100 instances deliver approximately 85 to 90 percent of rated GPU memory bandwidth through the hypervisor layer. For memory-bandwidth-bound LLM decode workloads, this translates directly into lower throughput and higher effective cost per token. A $3.90/hr virtual AWS H100 with 87 percent effective bandwidth produces approximately $4.48/hr of effective compute value. GMI Cloud bare metal at $2.00/hr produces $2.00/hr of full-rated compute value.

Storage. Network-attached storage for model weights (70 GB for Llama 3.3 70B FP8, 700 GB for Mistral Large 2 FP8) at $0.10 to $0.20/GB/month adds $7 to $140/month per model depending on provider. For teams running multiple models, storage costs become material.

Billing minimums. Providers that bill per hour (rather than per minute or per second) charge for full hours on every session. A training run that finishes in 47 minutes pays for 60 minutes. For teams running many short jobs daily, per-minute or per-second billing produces meaningful savings. GMI Cloud bills per minute; most hyperscalers bill per second; some smaller providers still use hourly minimums.

Choosing the Cheapest H100 Provider for Your Inference Workload

Maximum cost minimization, interruption-tolerant workloads (training, batch inference): Vast.ai marketplace at $0.34 to $1.87/hr or Spheron spot at $1.03/hr deliver the lowest H100 rates available. Requires checkpointing every 15 to 30 minutes and acceptance of occasional interruption.

Lowest cost, on-demand, no interruption risk (development, fine-tuning): Thunder Compute at $1.38/hr is the lowest published on-demand rate. For teams prioritizing raw rate without production SLA requirements, this is the benchmark.

Best cost and performance combination for production LLM inference: GMI Cloud at $2.00/hr bare metal. The combination of bare metal throughput, zero egress fees, managed model library for common inference workloads, and production infrastructure track record makes it the most cost-efficient option for sustained production serving.

Variable traffic with zero idle cost: GMI Cloud's Inference Engine at $0.10 per million input tokens for Qwen3-32B FP8, scaling to zero between requests. More cost-efficient than any on-demand H100 instance at utilization below 30 percent.

EU data residency required: Nebius at $2.10/hr is the lowest-cost EU-hosted H100 option with GDPR-compliant infrastructure.

Deep ecosystem integration with existing AWS infrastructure: AWS P5 reserved at $1.90 to $2.10/hr (1-year commitment) is competitive with specialized providers when the ecosystem integration value justifies the reservation commitment.

Conclusion

The H100 price market in 2026 is competitive enough that teams paying hyperscaler on-demand rates ($3.90 to $5.40/hr) for LLM inference are typically leaving 40 to 60 percent cost savings on the table compared to purpose-built providers. The cheapest on-demand H100 in 2026 is Thunder Compute at $1.38/hr. The cheapest spot H100 is Vast.ai at $0.34/hr with interruption risk.

For production LLM inference specifically, where reliability and consistent throughput matter as much as rate, GMI Cloud at $2.00/hr bare metal delivers the best effective cost per million output tokens among providers with production-grade infrastructure and SLA commitments. The bare metal advantage, zero egress fees, and managed inference library produce total economics that beat every hyperscaler and most specialized providers in sustained production operation.

The right answer for most teams is to use GMI Cloud's Inference Engine for variable traffic workloads and dedicated H100 or H200 clusters for high-utilization sustained serving, with spot-priced Vast.ai instances for checkpoint-friendly training experiments where the lowest possible rate matters more than availability guarantees.

FAQs

What is the cheapest H100 cloud provider for LLM inference in 2026? The absolute lowest published on-demand H100 rate is Thunder Compute at $1.38/hr, approximately 2.8x cheaper than AWS. Spot and marketplace pricing goes lower: Vast.ai lists H100 instances from $0.34/hr with variable host reliability. For production LLM inference where reliability matters alongside cost, GMI Cloud at $2.00/hr bare metal represents the best price-to-performance combination: bare metal delivery (no hypervisor overhead), zero egress fees, and managed inference library for common open-weight models. The market average across 45 tracked providers is $3.35/hr; paying above $2.50/hr for on-demand inference from a specialized provider without specific ecosystem requirements is above the market rate for comparable hardware.

Why are hyperscaler H100 rates so much higher than specialized providers? Three structural factors explain the 2 to 3x premium hyperscalers charge over specialized providers. First, hyperscaler business models are optimized for breadth (hundreds of services) rather than GPU cost leadership. Second, virtual machine infrastructure adds 10 to 15 percent hypervisor overhead that reduces effective GPU performance, inflating the real cost per unit of work. Third, egress fees ($0.08 to $0.12/GB) add 10 to 20 percent to inference bills for workloads that move model outputs or checkpoints. Specialized providers purpose-built for GPU workloads (GMI Cloud, Lambda Labs, Nebius) eliminate the hypervisor overhead with bare metal options and charge no egress fees, closing the effective cost gap to a fraction of the nominal rate difference.

What does H100 spot pricing offer and when is it worth using for LLM inference? Spot and marketplace pricing on H100s can reduce costs by 60 to 90 percent versus on-demand rates. Vast.ai market hosts list H100 instances from $0.34/hr; Spheron spot pricing starts at $1.03/hr; AWS spot pricing runs approximately $1.60 to $2.00/hr. The fundamental constraint is interruption: spot instances can be reclaimed by the provider with minimal notice. For LLM inference production endpoints where users are waiting for responses, spot instances are not viable. For batch LLM inference (processing document queues overnight, generating synthetic data, running evaluation benchmarks) where jobs checkpoint regularly and interruption means resuming rather than failing, spot pricing delivers the lowest possible H100 cost. The rule is: spot for async batch workloads, on-demand for production serving.

How do I calculate the real cost per million tokens for H100 LLM inference? The formula is: (GPU hourly rate × hours per month) / (output tokens per month in millions). For Llama 3.3 70B at FP8 with continuous batching at batch size 32 on a single H100 at 70 percent utilization, throughput runs approximately 2,000 to 3,000 tokens per second. At 70 percent utilization over 730 monthly hours: 2,500 tok/s × 2,628,000 active seconds = 6.57 billion tokens per month = 6,570 million output tokens. Monthly GPU cost at $2.00/hr: $1,460. Effective cost: $1,460 / 6,570 = $0.22 per million output tokens. This is the number to compare against managed API rates (Together AI $0.88/M, GMI Cloud Inference Engine $0.60/M) to identify the crossover point where dedicated infrastructure beats per-token pricing. At typical production utilization, self-hosted H100 inference on GMI Cloud produces effective output costs well below managed API rates above approximately 80 to 100 million output tokens per month.

Does the H200 ever make more sense than the H100 for LLM inference cost efficiency? For 70B to 140B parameter models, yes. The H200's 141 GB VRAM and 4.8 TB/s bandwidth versus the H100's 80 GB and 3.35 TB/s mean the H200 can serve Llama 3.3 70B on a single GPU at FP8 (versus requiring two H100s for some configurations), produce 1.4 to 1.9x higher throughput, and handle longer context windows without degradation. At GMI Cloud's pricing of $2.60/hr for H200 versus $2.00/hr for H100, the H200's throughput advantage for Llama 3.3 70B produces a lower effective cost per million output tokens at the same utilization. For smaller models (7B to 32B) that fit comfortably within H100's 80 GB, the H100 at $2.00/hr is the more cost-efficient choice. The H200 becomes the better buy specifically for models in the 70B to 140B range where memory bandwidth is the primary throughput constraint.

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

FAQ

The absolute lowest published on-demand H100 rate is Thunder Compute at $1.38/hr, approximately 2.8x cheaper than AWS. Spot and marketplace pricing goes lower: Vast.ai lists H100 instances from $0.34/hr with variable host reliability. For production LLM inference where reliability matters alongside cost, GMI Cloud at $2.00/hr bare metal represents the best price-to-performance combination: bare metal delivery (no hypervisor overhead), zero egress fees, and managed inference library for common open-weight models. The market average across 45 tracked providers is $3.35/hr; paying above $2.50/hr for on-demand inference from a specialized provider without specific ecosystem requirements is above the market rate for comparable hardware.

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started