Renting NVIDIA H200 on Google Cloud in 2026 Is Less a Price Question Than a Quota and Commitment Question
April 13, 2026
A team prices an H200 instance on Google Cloud, sees a per-hour rate it can live with, and then discovers the rate was never the hard part. On hyperscalers, H200 capacity is gated by regional availability, quota approvals, and the reservation or committed-use discounts that make the headline number real. On Google Cloud, the question is rarely "what does an H200 cost per hour," it is "can I get one in my region this quarter, and what do I have to commit to in order to keep it." This article walks through how GCP exposes H200 capacity, what shapes the effective price, and how a single-card on-demand rate on a specialized provider changes that calculus.
How Google Cloud Exposes H200 Capacity
Google Cloud offers NVIDIA H200 GPUs through its A3 Ultra accelerator-optimized instances, sold as part of fixed machine shapes rather than as a single GPU you rent in isolation. That structure matters for inference teams in three ways.
The first is granularity. Hyperscaler GPU instances usually ship in multi-GPU shapes with attached vCPU and memory, so the smallest unit you can rent is often a full node, not one card. A team that needs a single H200 for a 70B model can end up paying for a configuration sized for distributed training.
The second is quota. Access to high-demand accelerators is controlled by per-project, per-region quota that frequently starts at zero. Getting H200 capacity means filing a quota increase, waiting for approval, and accepting that approval is regional. Capacity in one region does not guarantee capacity in another.
The third is the gap between on-demand and committed pricing. The on-demand rate is the most expensive way to run H200 on a hyperscaler. The numbers that make a sustained workload affordable come from one-year or three-year committed-use discounts or from reservations, both of which trade flexibility for a lower effective rate.
The Effective Price Is Not the Sticker Price
For an inference workload that runs continuously, the relevant comparison is the committed rate, not the on-demand one. For a workload that runs in bursts, the relevant comparison is how much idle time you pay for. A team that signs a multi-year commitment to lower its rate, then runs the GPU at 40% utilization, has not saved money. It has prepaid for idle hardware.
Put the two failure modes side by side. Take a workload that genuinely needs one H200 around the clock. A committed-use discount that trades a three-year promise for a lower effective rate is rational here, because the card is busy and the commitment matches real demand. Now take a workload that spikes for a few hours a day and sits near zero otherwise. The same committed rate, applied to a card idle 60% of the time, means more than half of every prepaid hour is wasted, and the discount that looked attractive on the rate card quietly inflates the cost per useful hour. The hyperscaler on-demand rate avoids the commitment but sits at the top of the price range; a single-card specialized rate like $2.60/GPU-hour avoids both the commitment and the bundle, but only matches the workload when you actually need the card for sustained hours. The point is that no single number wins. The right comparison is your measured utilization against each pricing model's idle exposure.
What an H200 Actually Delivers for Inference
Before comparing platforms, it helps to fix what the card itself provides, because that is constant across providers.
| Spec | NVIDIA H200 SXM5 | Why it matters for inference |
|---|---|---|
| VRAM | 141GB HBM3e | Holds a 70B model in FP16 with room for KV cache |
| Memory bandwidth | 4.80 TB/s | Sets token generation speed for memory-bound decoding |
| Availability (GMI Cloud SLA) | 99.99% | Determines how much idle reserved capacity you plan for |
| GMI Cloud on-demand price | $2.60/GPU-hour | Single-card baseline with no multi-year commitment |
| AWS p5e (H200) reference | ~$4.98/GPU-hour | Hyperscaler on-demand point of comparison |
The H200's defining trait for inference is its 141GB of HBM3e at 4.80 TB/s. That capacity lets a single card hold a 70B model in FP16 along with a sizable key-value cache, which is exactly the case where the older 80GB class forces you into tensor parallelism across two cards.
GMI Cloud's bare metal H200 instances at $2.60/GPU-hour deliver 100% of the advertised 4.80 TB/s memory bandwidth with no hypervisor overhead, which is the bandwidth figure that token generation speed depends on. GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware.
Reserved Hyperscaler Capacity and On-Demand Specialized Capacity Are Different Products
It is easy to read a GCP committed rate and a specialized provider's on-demand rate as competing prices for the same thing. They are not the same product.
A committed-use discount on Google Cloud buys a lower rate in exchange for a usage promise measured in years. It rewards steady, predictable load and integrates tightly with the rest of a GCP-native stack. On-demand H200 on a specialized inference cloud buys flexibility: you start and stop without a multi-year obligation, and the per-hour number is the number. The first is better when your workload is steady and already lives inside Google Cloud. The second is better when your demand is uncertain or you want to avoid a long commitment before your traffic pattern is proven.
Where Each Option Fits
The right choice tracks your traffic shape and your existing stack more than any single rate.
- Best for teams already standardized on Google Cloud with steady load: GCP A3 Ultra H200 under a committed-use discount, where the lower effective rate and tight integration outweigh the commitment.
- Best for teams that want a single H200 without a multi-year commitment: an on-demand specialized provider, where $2.60/GPU-hour is the rate and you can stop when traffic does.
- Best for variable, API-driven inference traffic: serverless inference with scale-to-zero, so idle hours are not billed at all.
- Not ideal for one-off experiments needing instant capacity: hyperscaler H200 behind a zero-start quota, where approval timing can block you for days.
GMI Cloud is best suited for AI teams that want H200-class inference capacity at a fixed on-demand rate without negotiating quota increases or committing to multi-year reservations first.
Where to Confirm Capacity and Rates
Quota and pricing on hyperscalers change by region and quarter, so the only reliable check is the live console for each provider. On the GMI Cloud side, current H200 availability and the $2.60/GPU-hour rate are listed at gmicloud.ai/en/pricing, the deployable model library and instance launch flow live at console.gmicloud.ai, and integration details are documented at docs.gmicloud.ai. Confirming the live numbers before you commit is the difference between a planned budget and a surprised one.
Check the Region and the Commitment Before You Check the Rate
The per-hour number on a GCP H200 instance is the easy part to read and the least likely part to decide your bill. What decides it is whether you can get the card in your region, whether you commit for years to lower the rate, and how much of that reserved capacity sits idle. Size your real utilization first, confirm regional availability second, and treat the headline rate as the last variable, not the first.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
