Google Cloud Accelerator Pricing Looks Like One Menu but Hides Three Different Buying Decisions
April 13, 2026
A team opens the Google Cloud accelerator catalog expecting to pick a GPU and finds A3 instances, A4 instances, and several TPU generations, each priced on its own logic. The instinct is to sort by hourly rate and pick the cheapest. That instinct picks the wrong accelerator more often than not, because A3, A4, and TPU are not three price points on one product. They are three different products aimed at different workloads. The accelerator that costs less per hour can cost more per inference if it is the wrong architecture for your model. This article separates the three buying decisions inside GCP accelerator pricing and shows where a focused GPU cloud sets the cost reference.
The Three Families and What Each Is Priced For
Google Cloud groups its accelerators into families whose pricing reflects what they were built to do, not a simple performance ladder. Understanding the split is the prerequisite to reading any rate.
A3 Is the H100 Generation
A3 instances are built around NVIDIA H100 GPUs and represent the mainstream production-inference and training tier on GCP. They run the CUDA stack your team likely already uses, which means no porting work and broad framework support. A3 pricing reflects on-demand access to H100-class hardware inside a large general-purpose cloud, with the elasticity and compliance that implies, and the overhead that comes with it.
A4 Is the Blackwell Generation
A4 instances move to NVIDIA's newer Blackwell-generation GPUs, the B200 class. They carry more memory and bandwidth per card and native support for lower-precision formats like FP4, which raises effective throughput for large models. A4 is priced above A3 because the silicon is newer and faster, and it is the right tier when your model is large enough or your throughput target high enough to use the extra capacity.
TPU Is a Different Architecture Entirely
TPUs are Google's own accelerators, not NVIDIA GPUs. They can be cost-effective for workloads built on frameworks that target them well, particularly JAX and TensorFlow at scale. The catch is the ecosystem boundary: a serving stack written for CUDA does not run on a TPU without rework. TPU pricing can look attractive per unit of compute, but the migration cost is the hidden line item.
Reading the Selection by Constraint
The table below frames the three families by the decision each one answers, with VRAM and architecture as the columns that actually separate them, alongside a focused-cloud reference rate.
| Accelerator family | Hardware class | Per-card memory | Ecosystem | Reference GPU rate |
|---|---|---|---|---|
| GCP A3 | NVIDIA H100 | 80GB HBM3 | CUDA | GMI Cloud H100 $2.00/GPU-hour |
| GCP A4 | NVIDIA B200 | 180GB HBM3e | CUDA | GMI Cloud B200 $4.00/GPU-hour |
| GCP TPU | Google TPU | Varies by generation | JAX / TensorFlow | No direct GPU equivalent |
A few readings are worth making explicit.
- A3 versus A4 is a capacity-and-precision decision, not a cheap-versus-expensive one. Pick A3 for 70B-class CUDA serving; pick A4 when model size or FP4 throughput justifies the newer silicon.
- TPU versus the GPU families is an ecosystem decision. The question is not the rate, it is whether your stack runs on TPU at all without a rewrite.
- The hourly rate is one input, not the answer. The same H100-class card is priced very differently depending on whether you rent it inside a hyperscaler or a focused GPU cloud, which is why the reference column matters.
The Boundary Most Comparisons Miss
GCP accelerator pricing and total inference cost are not the same thing, and conflating them is the common mistake. The hourly rate is one component. The others are utilization, the elasticity of the platform, and the migration cost of an unfamiliar architecture. A TPU that is cheaper per hour but requires porting a CUDA serving stack can cost more in engineering time than it saves in compute. An A4 that is more expensive per hour but serves a large model at higher throughput can cost less per inference than an A3. The right comparison is cost per unit of work delivered, after accounting for utilization and porting, not the number on the rate card.
There is a second boundary inside the GPU families themselves. General-purpose cloud instances run virtualized, which can shave a portion of advertised memory bandwidth off through hypervisor overhead. For memory-bound inference, where bandwidth correlates directly with tokens per second, that slice is throughput you paid for and do not receive.
Where a Focused GPU Cloud Sets the Reference
Once you know whether your workload wants H100-class, B200-class, or a non-CUDA path, the next question is what the same hardware should cost when inference is the only thing the platform is built for.
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The H100 at $2.00/GPU-hour and B200 at $4.00/GPU-hour map directly to the A3 and A4 hardware classes, validated against NVIDIA Reference Architecture and backed by a 99.99% platform availability SLA. GMI Cloud's bare metal instances run with no hypervisor, so memory-bound inference receives 100% of the advertised bandwidth rather than a virtualized fraction. Unlike a general-purpose cloud where accelerators are one service among hundreds, GMI Cloud is optimized specifically for AI inference, with CUDA 12.x, TensorRT-LLM, and vLLM preconfigured.
You can confirm current pricing for both hardware classes at gmicloud.ai/en/pricing and console.gmicloud.ai before mapping your GCP selection to a focused-cloud equivalent.
Matching the Family to the Workload
The right GCP accelerator follows from your model and your stack, not the cheapest line on the menu.
- Best for mainstream CUDA inference and training: A3 or an equivalent H100, the no-porting default.
- Best for large models or FP4 throughput: A4 or an equivalent B200, when newer silicon pays off.
- Best for JAX or TensorFlow workloads at scale: TPU, if your stack already targets it.
- Not ideal for a CUDA serving stack on a tight timeline: TPU, whose migration cost can exceed its per-hour savings.
Price the Workload, Not the Instance
The GCP accelerator catalog rewards teams that read it as three decisions rather than one price list. Identify whether you need H100-class capacity, B200-class throughput, or a TPU's architecture, account for utilization and any porting cost, and only then compare hourly rates across providers for the family you chose. The cheapest instance per hour is rarely the cheapest per inference. The reliable number is cost per unit of work, measured after the architecture decision is made.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
