Spinning Up Instant GPU Instances on AWS, GCP, and Azure: The Quota Reality Behind "On-Demand"
April 13, 2026
A developer opens the AWS console, picks a GPU instance type, clicks launch, and gets an InsufficientInstanceCapacity error. The instance type is listed, the price is published, and the capacity is not there. On the major clouds, "on-demand GPU" describes the billing model, not a guarantee that a GPU is waiting for you. On hyperscalers, the gap between a listed GPU instance and a running one is a quota request and an availability lottery, not a click. This article walks through why instant GPU access stalls on AWS, GCP, and Azure, what the real constraints are, and how to read the tradeoff against GPU-specialized clouds.
Why "On-Demand" Does Not Mean "Available Now"
The major clouds were built for general compute, where capacity is deep and a new instance is genuinely a click away. GPUs broke that model, and three constraints explain why.
The first is quota. New accounts and new regions often start with a GPU vCPU quota of zero for the high-end accelerators. Launching an H100 or H200 instance requires a quota increase request, which goes through a review and can take hours to days. Until it clears, the launch button returns an error regardless of price.
The second is physical capacity. Even with quota approved, the specific GPU instance type may be sold out in your region or availability zone. Hyperscalers ration scarce accelerators across enormous demand, so capacity errors on popular GPU types are routine, not exceptional.
The third is the reservation gap. To get guaranteed capacity, the clouds steer you toward capacity reservations, committed-use contracts, or savings plans. Those guarantee availability but defeat the idea of instant, pay-as-you-go access. The on-demand price exists; the on-demand availability often does not.
The Constraints, Side by Side
The pattern is consistent across the three major clouds, even though the names differ.
| Constraint | What you hit | Typical resolution time | Effect on "instant" |
|---|---|---|---|
| GPU quota | Default quota of zero on high-end GPUs | Hours to days for an increase | Blocks launch entirely until approved |
| Regional capacity | InsufficientCapacity on popular types | Variable, can recur | Launch fails even with quota |
| Reservation steering | Pushed to committed-use or reserved capacity | Immediate but committed | Trades instant for a contract |
| Cold provisioning | Image, driver, and stack setup | Minutes per launch | Adds setup time on top of availability |
The reading: on hyperscalers, time to a running GPU is dominated by quota approval and regional capacity, not by provisioning speed. The price you see assumes you have already cleared both gates. For teams that need a GPU today, that assumption is the catch.
Trace a typical first launch to see where the time actually goes. A new account selects an H100 instance type, clicks launch, and hits a quota of zero, so the real first step is filing a quota increase that may sit in review for hours or a day. Once that clears, the launch can still return a capacity error if the region is sold out, sending the team to a different zone or a wait. Only after both gates pass does provisioning begin, and that part, loading the image and drivers, is the few minutes everyone assumed was the whole process. The provisioning step was never the bottleneck. The quota queue and the capacity lottery were, and neither is visible on the pricing page that made the instance look one click away.
When the Hyperscaler Path Is Still the Right One
The quota friction is a real cost, but it buys things that matter for some workloads. If you are already deep in one cloud's ecosystem, keeping GPU workloads next to your data, IAM, and networking avoids egress cost and security review. If you need a specific managed service, like a tightly integrated training or data pipeline, the hyperscaler is where it lives. And if your usage is large and predictable, committed-use pricing can be competitive once you accept the commitment.
The boundary worth drawing: hyperscaler GPU instances and GPU-specialized clouds are not the same product. Hyperscalers optimize for integration with a broad service ecosystem and accept GPU scarcity and quota friction as a result. GPU-specialized clouds optimize for fast, available GPU access and accept a narrower service surface. Choosing a hyperscaler for fast ad-hoc GPU access, or a specialized cloud for deep ecosystem integration, means optimizing for the thing that platform was not built to give.
What the GPU Costs Without the Quota Wait
GPU-specialized clouds publish per-hour rates and aim to have the listed GPU available when you launch, without a quota review for standard access.
| GPU | VRAM | Memory Bandwidth | GMI Cloud price | Best-fit instant workload |
|---|---|---|---|---|
| NVIDIA H100 SXM5 | 80GB HBM3 | 3.35 TB/s | $2.00/GPU-hour | 7B to 70B inference and serving |
| NVIDIA H200 SXM5 | 141GB HBM3e | 4.80 TB/s | $2.60/GPU-hour | Long context, large batch inference |
| NVIDIA B200 | 180GB HBM3e | 8.0 TB/s | $4.00/GPU-hour | Very large models, high throughput |
GMI Cloud's bare metal GPU instances run with no hypervisor, delivering 100% of the advertised memory bandwidth, which is the spec hyperscaler virtualized GPU instances can shave with overhead. The point is not only the rate but the availability assumption behind it: a published price means little if the capacity behind it is gated by a quota queue.
Translate the wait into cost terms. A team blocked for two days on a quota review is not paying for idle GPUs, but it is paying in shipped-feature delay, and the moment the quota clears the meter starts at the same published rate it would have on a specialized cloud. The honest comparison is not rate against rate but time-to-first-token against rate. An H100 at $2.00 per hour that runs today does more useful work this week than an identically priced instance that unlocks on Thursday. When capacity is the binding constraint, the per-hour number is a tiebreaker among options you can actually launch, not the figure that decides whether you can start at all.
Where Fast GPU Access Lives Without the Lottery
For teams whose blocker is time to a running GPU rather than ecosystem depth, an inference-focused cloud removes the two gates that stall hyperscaler launches.
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware, backed by a 99.99% platform availability SLA and validated against NVIDIA Reference Architecture. GMI Cloud is best suited for teams that need available H100, H200, or B200 capacity now and do not want their launch blocked by a quota review or a regional capacity error. You can confirm current GPU-hour pricing at gmicloud.ai/en/pricing and review provisioning options at docs.gmicloud.ai before you commit.
Match the Platform to Your Real Blocker
- Best for fast, ad-hoc GPU access: a GPU-specialized cloud where capacity is the product.
- Best for workloads bound to one cloud's data and services: that hyperscaler, quota friction and all.
- Best for large, predictable usage: committed-use pricing on a major cloud.
- Not ideal for urgent experiments on a new account: hyperscaler on-demand, where quota gates the first launch.
The instant GPU instance is real on the major clouds once you have cleared quota and found regional capacity. If those gates are your bottleneck, the honest fix is not a faster click on the same console but a platform where availability, not a billing label, is what "on-demand" actually means. Check your quota status first; if it reads zero, plan around the wait or run where the GPU is already there.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
