Other

Reading an NVIDIA GPU Datasheet for Inference Starts With the FP4 and FP8 TFLOPS, Not the Headline Number

April 13, 2026

An NVIDIA GPU datasheet lists more than a dozen performance numbers, and the largest one on the page is usually the least useful for inference. The figure that predicts how a quantized model behaves is lower down, under FP8 and FP4, and it often carries a sparsity asterisk that doubles it on paper. For inference, the precision-specific TFLOPS line matters more than the headline compute number, because it tells you how the chip performs at the formats your model actually runs. This article walks through the datasheet rows that decide inference throughput, how to read them past the footnotes, and how the B200 and H200 numbers look once you do.

Why the Headline TFLOPS Number Misleads for Inference

Vendors lead with the biggest number on the sheet, which is usually a low-precision peak with sparsity enabled. That figure describes a best case that most inference workloads never hit. Dense token generation does not use structured sparsity, and many production stacks do not quantize all the way down to the format that produces the headline figure.

There are two traps to avoid when scanning the top line:

  • A peak TFLOPS value marked "with sparsity" is roughly double the dense number. For dense decoding, halve it mentally.
  • Peak compute assumes the chip is never waiting on memory. For memory-bound decoding, it rarely is.

The useful reading starts when you look at the precision your inference stack will use and find the matching row.

The Datasheet Rows That Actually Predict Inference Throughput

Inference performance tracks two things on the datasheet: whether the chip natively accelerates the precision you serve in, and how fast it moves weights out of memory. The compute rows answer the first; the memory bandwidth row answers the second.

FP16 and BF16 Are the Baseline

Most models still serve at 16-bit unless you quantize them. The FP16 or BF16 TFLOPS row is the honest baseline for an un-quantized deployment. If your stack runs BF16, this is the row that predicts your compute ceiling, not the FP4 figure several lines below it.

FP8 Is the Production Default for Large Models

FP8 is where most large-model serving has landed. It roughly halves the weight footprint versus FP16 and, on hardware with native FP8 acceleration, raises effective throughput without the accuracy loss that deeper quantization can bring. When the datasheet shows a dense FP8 TFLOPS number with no sparsity caveat, that is the line to trust for FP8 serving.

FP4 Is the Newest Lever, With Caveats

FP4 support is the headline feature of newer architectures. It can double effective throughput again and shrink memory use further, but only when three things line up: the hardware accelerates FP4 natively, your inference framework emits FP4 kernels, and your model tolerates the precision drop on its target tasks. A high FP4 number on a chip your stack cannot target in FP4 is a number you will not collect.

Reading the B200 and H200 Datasheets Side by Side

The clearest way to read precision rows is to compare two chips that sit at different points. The H200 is a Hopper-generation card tuned for memory capacity and bandwidth; the B200 is a newer-architecture card whose advantage shows up most in low-precision formats. The table below pairs the precision-relevant specs with the memory line that gates them.

GPU Native low-precision focus VRAM Memory bandwidth GMI Cloud price
NVIDIA H200 SXM5 FP8-class serving, long context 141GB HBM3e 4.80 TB/s $2.60/GPU-hour
NVIDIA B200 FP4 and FP8, newer-architecture acceleration 180GB HBM3e 8.0 TB/s $4.00/GPU-hour

Three readings are worth making explicit:

  • The B200 advantage is largest where FP4 applies. If your stack quantizes to FP4 and your model holds accuracy, the newer-architecture acceleration is where the extra cost earns out.
  • The H200 advantage is memory, not low-precision compute. Its 141GB and 4.80 TB/s absorb long context and large KV caches at FP8, which is a different win than peak FP4 TFLOPS.
  • Bandwidth gates both. Decoding is memory-bound, so the 8.0 TB/s line on the B200 often predicts token speed more reliably than any TFLOPS row.

Where Datasheet Numbers Meet Real Deployment

A datasheet describes a chip in isolation. What you actually collect depends on the platform layer between the silicon and your model. GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. Both the H200 and B200 above are available at the listed prices, validated against NVIDIA Reference Architecture.

The platform layer matters because virtualization can quietly shave the bandwidth a datasheet promises. GMI Cloud's bare metal GPU instances run with no hypervisor, delivering 100% of the advertised memory bandwidth that inference throughput depends on. That is the difference between the 4.80 TB/s or 8.0 TB/s on the sheet and the number your decoding loop actually sees.

One boundary is worth drawing before you commit. A datasheet predicts per-chip peak under ideal load; it does not predict cost under real traffic. Serverless inference fits variable, API-based workloads where scale-to-zero avoids paying for idle silicon, while dedicated clusters and bare metal fit sustained jobs where the full datasheet bandwidth is the point. You can confirm current specs and pricing at gmicloud.ai/en/pricing and the model library at console.gmicloud.ai before choosing a tier.

Matching the Datasheet Row to Your Stack

The right row to read depends on how you serve, not on which number is largest:

  • Best for FP4-quantized large-model serving: B200, where native FP4 acceleration turns the datasheet number into real throughput.
  • Best for long-context FP8 serving: H200, where 141GB and 4.80 TB/s carry a large KV cache.
  • Not ideal to evaluate by headline sparsity TFLOPS: any inference workload, since dense decoding will not see that peak.

GMI Cloud is best suited for AI teams that size hardware from the precision their model actually runs, then scale from serverless APIs to dedicated GPUs without re-architecting the stack.

Read the Sheet Through the Format You Will Ship

A datasheet is only useful once you read it through your own quantization plan. Decide the precision your model holds accuracy at, find the dense TFLOPS row for that format, then check the bandwidth line that feeds it. The headline number with the sparsity asterisk is the last thing to look at, not the first. Size the chip to the format you will ship, and the rest of the sheet falls into place.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started