Question 1

What is a Prime Inference endpoint for AI inference?

Accepted Answer

A Prime Inference endpoint is a single-tenant inference endpoint where your model runs on GPU capacity reserved exclusively for your workload. Unlike shared serverless inference, traffic from other tenants cannot affect your throughput or latency. Performance is bounded by the hardware you provision, and rate limits are configured to your dedicated capacity rather than a shared pool — making Prime Inference the standard choice for production AI workloads that need predictable latency, sustained throughput, and workload isolation.

Question 2

What's the difference between serverless inference and Prime Inference?

Accepted Answer

Serverless inference is multi-tenant and pay-per-token, with shared rate limits and a fixed model catalog — best for prototyping, low-volume use, and elastic traffic. Prime Inference is single-tenant, billed per-GPU-hour, and lets you deploy any open-source or custom model on reserved NVIDIA GPUs with tuned runtimes. Choose Prime Inference when you need predictable latency at the P95/P99 tail, sustained high throughput, custom or fine-tuned model weights, or workload isolation for compliance.

Question 3

What's the difference between H100, H200, and Blackwell GPUs?

Accepted Answer

NVIDIA H100 (80 GB HBM3) is the standard workhorse for most LLM and multimodal inference workloads — the baseline for general production traffic. H200 (141 GB HBM3e) offers approximately 1.4× the memory and bandwidth of H100, ideal for long-context models, large KV-cache workloads, and memory-bound serving. Blackwell B200 (192 GB HBM3e) targets frontier models and FP4 inference, delivering up to ~2.5× higher throughput. Most production workloads on GMI Cloud run on H100 or H200, with Blackwell reserved for performance-critical frontier use cases.

Question 4

Can I deploy a custom or fine-tuned model on a Prime Inference endpoint?

Accepted Answer

Yes. GMI Cloud Prime Inference supports any Hugging Face model, custom fine-tuned weights, and proprietary models loaded from Hugging Face, S3, or your own storage. Models load on the GMI inference stack — vLLM, TensorRT-LLM, SGLang — without re-engineering the serving layer. Per-model runtime tuning means even custom weights benefit from optimized kernels and routing on reserved NVIDIA GPUs.

Question 5

How does pricing work and what's the minimum commitment?

Accepted Answer

There is no minimum contract. On-demand billing is hourly per GPU, with no per-token markup and no shared-pool surge pricing. For sustained production workloads, reserved capacity is available on a seasonal basis or annually at lower per-hour rates. Qualified prospects also receive free GPU-hour trial credits to validate performance against their own workload. For current GPU rates, a tailored quote based on your model and traffic profile, and more details, contact sales.

Ship faster. Scale further.One Prime Inference endpoint.

Three reasons to run inference here.

Higher throughput on the models that matter

Capacity where your users live

Scale with your traffic, not your forecast

Lease the GPU. Own the throughput.

Tuned runtimes

Warm by default

Single-tenant isolation

Bring your own model

Optimized for the models you use

Production-grade engines

Deploy close to your users.

Asia-Pacific

North America

Europe

Scale with your traffic.

Burstable capacity

Pay-as-you-rest

One global pool

From idea to live endpoint, in four steps.

Pick a model

Choose your setup

Deploy

Operate & scale

Access the model you want.

DeepSeek V4

Kimi K2.6

GLM 5.1

Llama 4

Nemotron Omni

DeepSeek V4

Kimi K2.6

GLM 5.1

Llama 4

Nemotron Omni

Workloads where shared inference falls short.

Coding agents & developer tools

TTS, transcription, conversation

RAG & chat at scale

Private & compliant deployments

Pick the right GPU for the job.

H100

H200

B200

FAQ

What is a Prime Inference endpoint for AI inference?

What's the difference between serverless inference and Prime Inference?

What's the difference between H100, H200, and Blackwell GPUs?

Can I deploy a custom or fine-tuned model on a Prime Inference endpoint?

How does pricing work and what's the minimum commitment?

Ready when you are.