What architecture does it use?

A hybrid Mamba-Transformer Mixture of Experts (MoE) architecture, combined with LatentMoE for expert routing efficiency, Multi-Token Prediction for faster long-form generation, and 1M-token context for sustained long-running agent sessions.

What makes it different from Nemotron 3 Super or Nano?

The Nemotron 3 family spans three sizes: Nano (30B / 3B active) for targeted efficient tasks; Super (120B / 12B active) for collaborative agents and high-volume workloads; and Ultra (550B / 55B active) for the hardest reasoning and orchestration tasks where accuracy cannot be compromised.

What context length does it support?

Up to 1 million tokens, making it practical to sustain full plan state and conversation history across long-running agent sessions.

How do I deploy it on GMI Cloud?

Two paths: (1) use the GMI Cloud serverless inference API via an OpenAI-compatible call, or (2) reserve a dedicated GPU cluster (H200 or Blackwell) for production-scale workloads. See docs.gmicloud.ai for setup instructions.

Do I need to manage GPUs?

No. GMI Cloud handles all GPU provisioning, cluster management, and infrastructure operations.

What GPU configurations support Ultra?

BF16: 16x NVIDIA H100, 8x H200, or 8x NVIDIA GB200/B200/B300. FP8: 8x NVIDIA H100, 4x H200, or 4x NVIDIA GB200/B200. NVFP4: 2x NVIDIA GB200/B200. All configurations are available on GMI Cloud.

Yes. NVIDIA releases open weights, the full training dataset, and open post-training recipes, giving enterprises the flexibility to fine-tune for any domain and deploy on any infrastructure.

GMI Cloud Brings NVIDIA Nemotron 3 Ultra to Developers on Day 0

June 04, 2026

NVIDIA Nemotron 3 Ultra is now available on GMI Cloud. For teams building production-grade agentic AI systems, GMI Cloud's H200 and Blackwell infrastructure provide Day-0 access to Nemotron 3 Ultra at scale.

Why Nemotron 3 Ultra Is Different

Most frontier reasoning models force a trade-off: more accuracy means slower inference, and faster throughput means hitting an intelligence ceiling. Nemotron 3 Ultra is engineered to break that trade-off.

At 550B total parameters with only 55B active per token, Nemotron 3 Ultra is built for long-running autonomous agents, orchestration, and complex reasoning across coding, deep research, and enterprise workflows. Ultra achieves 5x higher throughput and up to 30% lower cost compared to other open models in its class, enabling more reasoning cycles within a given time budget so agents can finish tasks faster.

Architecture Highlights

Feature	Detail
Architecture	Hybrid Mamba-Transformer Mixture of Experts (MoE)
Total parameters	550B
Active parameters per token	55B
Context length	Up to 1M tokens
Model I/O	Text in, text out
Precision support	BF16, FP8, NVFP4
Supported GPUs (BF16)	16x H100, 8x H200, 8x GB200/B200/B300
Supported GPUs (FP8)	8x H100, 4x H200, 4x GB200/B200
Supported GPUs (NVFP4)	2x GB200/B200
License	NVIDIA open-model license (open weights, open data, open recipes)

‍

‍LatentMoE enables calling 4 experts for the inference cost of only one, improving the intelligence and generalization of the model at no added compute cost.

Multi-Token Prediction (MTP) predicts multiple future tokens in a single forward pass. For long-form agentic sessions involving multi-file code refactors or cross-document synthesis, this reduces generation time for long sequences.

1M-token context retains conversation history and plan states across long-running agent sessions, enabling cross-document reasoning.

Token Budget Control provides optimal accuracy with minimum reasoning for token generation.

Why Run It on GMI Cloud

Blackwell and H200 Infrastructure Built for Ultra's Architecture

Nemotron 3 Ultra is optimized to run on GMI Cloud's GB200/B200/B300 and H200 clusters across BF16, FP8, and NVFP4 precisions. At NVFP4 precision, Ultra runs on just 2x GB200/B200. GMI Cloud operates NVIDIA Reference Architecture-validated infrastructure.

Serverless to Dedicated, No Infrastructure Overhead

Deploy Nemotron 3 Ultra via GMI Cloud's serverless inference API in minutes, or reserve a dedicated GPU cluster for production-scale throughput. No GPU provisioning or cluster management required. Scale from a single API call to a multi-node production workload without changing your code.

OpenAI-Compatible API, Enterprise Compliance Ready

GMI Cloud's inference endpoints are fully OpenAI-compatible. Integrate Ultra into any existing LLM pipeline with a single base_url change.

Watching Nemotron 3 Ultra Think

In this demo, we built a browser-based interface that lets Nemotron 3 Ultra run as an autonomous, tool-using agent: you give it a task, and then watch it plan each step, call tools like web search, summarization, code execution, and verification, and stream back the final answer in real time.

As the agent works, every tool call instantly appears as a new node on a live graph, so you can literally see the decision tree forming as Ultra thinks. That visual timeline makes it obvious how Ultra’s high token throughput turns into more reasoning cycles and more completed steps per unit of time, turning what’s usually a black-box answer into a transparent walkthrough of the model’s reasoning process

What Else You Can Build

Autonomous Coding Agents‍

Coding agents that plan, code, test, debug, and iterate resolve issues end-to-end across large codebases. Ultra handles the hard reasoning calls: architectural planning, complex multi-file refactors, and error recovery. On GMI Cloud's Blackwell infrastructure, sustained throughput means agents finish reasoning loops faster with more iterations per time budget.

Deep Research and Synthesis‍

Research agents that search, evaluate, cross-reference, and synthesize across hundreds of sources require a model that can maintain contradictory evidence over an extremely long context without degrading. Ultra's 1M-token context window makes sustained parallel reasoning loops viable in production.

Enterprise Workflow Automation‍

Agents automating workflows in security operations, regulatory compliance, or clinical trial management run in persistent, tool-using loops across complex multi-step task graphs. Ultra handles triage logic across thousands of alerts, synthesis of regulatory filings, and orchestration of interdependent operations. GMI Cloud's dedicated GPU clusters provide the isolation and performance consistency that enterprise SLAs require.

EDA and Chip Design Agents‍

Chip design agents that autonomously generate RTL from specifications, verify designs across thousands of constraints, and orchestrate workflows from design to sign-off place the heaviest demands on reasoning models. Ultra handles verification, failure analysis, and cross-block dependency resolution. GMI Cloud's Blackwell bare metal provides the hardware substrate these workflows require.

Get Started

python

from openai import OpenAI

client = OpenAI(

api_key="YOUR_GMI_API_KEY",

base_url="https://api.gmicloud.ai/v1"

)

response = client.chat.completions.create(

model="nvidia/NVIDIA-Nemotron-3-Ultra",

messages=[

{

"role": "system",

"content": "You are an expert AI agent. Reason step by step."

},

{

"role": "user",

"content": "Analyze the following codebase and propose a refactor plan: ..."

}

],

stream=True

)

for chunk in response:

if chunk.choices[0].delta.content:

print(chunk.choices[0].delta.content, end="")

‍

Check out the full documentation at docs.gmicloud.ai

Nemotron 3 Ultra is built for long-horizon, tool-using agents, and GMI Cloud gives you the GPU infrastructure to run them at production scale. If you are pushing the limits of reasoning-heavy AI systems, this is your new default stack.

Deploy Nemotron 3 Ultra on GMI Cloud | Read the Docs | Join the Community on Discord

‍

Roan Weigert

DevRel @ GMI Cloud

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

FAQ

NVIDIA Nemotron 3 Ultra is a 550B-parameter open frontier-reasoning model with 55B active parameters per token, purpose-built for long-running autonomous agents handling complex orchestration tasks across coding, deep research, and enterprise workflows.

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started

GMI Cloud Brings NVIDIA Nemotron 3 Ultra to Developers on Day 0

Why Nemotron 3 Ultra Is Different

Architecture Highlights

Why Run It on GMI Cloud

Blackwell and H200 Infrastructure Built for Ultra's Architecture

Serverless to Dedicated, No Infrastructure Overhead

OpenAI-Compatible API, Enterprise Compliance Ready

Watching Nemotron 3 Ultra Think

What Else You Can Build

Autonomous Coding Agents‍

Deep Research and Synthesis‍

Enterprise Workflow Automation‍

EDA and Chip Design Agents‍

Get Started

Build AI Without Limits

FAQ

What is NVIDIA Nemotron 3 Ultra?

What architecture does it use?

What makes it different from Nemotron 3 Super or Nano?

What context length does it support?

How do I deploy it on GMI Cloud?

Do I need to manage GPUs?

What GPU configurations support Ultra?

Is it open-source?

Ready to build?