Announcements

GMI Cloud Brings NVIDIA Nemotron 3 Ultra to Developers on Day 0

June 04, 2026

NVIDIA Nemotron 3 Ultra is now available on GMI Cloud. For teams building production-grade agentic AI systems, GMI Cloud's H200 and Blackwell infrastructure provide Day-0 access to Nemotron 3 Ultra at scale.

Why Nemotron 3 Ultra Is Different

Most frontier reasoning models force a trade-off: more accuracy means slower inference, and faster throughput means hitting an intelligence ceiling. Nemotron 3 Ultra is engineered to break that trade-off.

At 550B total parameters with only 55B active per token, Nemotron 3 Ultra is built for long-running  autonomous agents, orchestration, and complex reasoning across coding, deep  research, and enterprise workflows. Ultra achieves 5x higher throughput and up to 30% lower cost compared to other open models in its class, enabling more reasoning cycles within a given time budget so agents can finish tasks faster.

Architecture Highlights

Feature

Detail

Architecture

Hybrid Mamba-Transformer Mixture of Experts (MoE)

Total parameters

550B

Active parameters per token

55B

Context length

Up to 1M tokens

Model I/O

Text in, text out

Precision support

BF16, FP8, NVFP4

Supported GPUs (BF16)

16x H100, 8x H200, 8x GB200/B200/B300

Supported GPUs (FP8)

8x H100, 4x H200, 4x GB200/B200

Supported GPUs (NVFP4)

2x GB200/B200

License

NVIDIA open-model license (open weights, open data, open recipes)

LatentMoE enables calling 4 experts for the inference cost of only one, improving the intelligence and generalization of the model at no added compute cost.

Multi-Token Prediction (MTP) predicts multiple future tokens in a single forward pass. For long-form agentic sessions involving multi-file code refactors or cross-document synthesis, this reduces generation time for long sequences.

1M-token context retains conversation history and plan states across long-running agent sessions, enabling cross-document reasoning.

Token Budget Control provides optimal accuracy with minimum reasoning for token generation.

Why Run It on GMI Cloud

Blackwell and H200 Infrastructure Built for Ultra's Architecture

Nemotron 3 Ultra is optimized to run on GMI Cloud's GB200/B200/B300 and H200 clusters across BF16, FP8, and NVFP4 precisions. At NVFP4 precision, Ultra runs on just 2x GB200/B200. GMI Cloud operates NVIDIA Reference Architecture-validated infrastructure.

Serverless to Dedicated, No Infrastructure Overhead

Deploy Nemotron 3 Ultra via GMI Cloud's serverless inference API in minutes, or reserve a dedicated GPU cluster for production-scale throughput. No GPU provisioning or cluster management required. Scale from a single API call to a multi-node production workload without changing your code.

OpenAI-Compatible API, Enterprise Compliance Ready

GMI Cloud's inference endpoints are fully OpenAI-compatible. Integrate Ultra into any existing LLM pipeline with a single base_url change.

Watching Nemotron 3 Ultra Think

In this demo, we built a browser-based interface that lets Nemotron 3 Ultra run as an autonomous, tool-using agent: you give it a task, and then watch it plan each step, call tools like web search, summarization, code execution, and verification, and stream back the final answer in real time.

As the agent works, every tool call instantly appears as a new node on a live graph, so you can literally see the decision tree forming as Ultra thinks. That visual timeline makes it obvious how Ultra’s high token throughput turns into more reasoning cycles and more completed steps per unit of time, turning what’s usually a black-box answer into a transparent walkthrough of the model’s reasoning process

What Else You Can Build

Autonomous Coding Agents

Coding agents that plan, code, test, debug, and iterate resolve issues end-to-end across large codebases. Ultra handles the hard reasoning calls: architectural planning, complex multi-file refactors, and error recovery. On GMI Cloud's Blackwell infrastructure, sustained throughput means agents finish reasoning loops faster with more iterations per time budget.

Deep Research and Synthesis

Research agents that search, evaluate, cross-reference, and synthesize across hundreds of sources require a model that can maintain contradictory evidence over an extremely long context without degrading. Ultra's 1M-token context window makes sustained parallel reasoning loops viable in production.

Enterprise Workflow Automation

Agents automating workflows in security operations, regulatory compliance, or clinical trial management run in persistent, tool-using loops across complex multi-step task graphs. Ultra handles triage logic across thousands of alerts, synthesis of regulatory filings, and orchestration of interdependent operations. GMI Cloud's dedicated GPU clusters provide the isolation and performance consistency that enterprise SLAs require.

EDA and Chip Design Agents

Chip design agents that autonomously generate RTL from specifications, verify designs across thousands of constraints, and orchestrate workflows from design to sign-off place the heaviest demands on reasoning models. Ultra handles verification, failure analysis, and cross-block dependency resolution. GMI Cloud's Blackwell bare metal provides the hardware substrate these workflows require.

Get Started

python

from openai import OpenAI

client = OpenAI(

   api_key="YOUR_GMI_API_KEY",

   base_url="https://api.gmicloud.ai/v1"  

)

response = client.chat.completions.create(

   model="nvidia/NVIDIA-Nemotron-3-Ultra",  

   messages=[

       {

           "role": "system",

           "content": "You are an expert AI agent. Reason step by step."

       },

       {

           "role": "user",

           "content": "Analyze the following codebase and propose a refactor plan: ..."

       }

   ],

   stream=True

)

for chunk in response:

   if chunk.choices[0].delta.content:

       print(chunk.choices[0].delta.content, end="")

Check out the full documentation at docs.gmicloud.ai

Nemotron 3 Ultra is built for long-horizon, tool-using agents, and GMI Cloud gives you the GPU infrastructure to run them at production scale. If you are pushing the limits of reasoning-heavy AI systems, this is your new default stack.

Deploy Nemotron 3 Ultra on GMI Cloud | Read the Docs | Join the Community on Discord

Roan Weigert

Roan Weigert

DevRel @ GMI Cloud

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

FAQ

NVIDIA Nemotron 3 Ultra is a 550B-parameter open frontier-reasoning model with 55B active parameters per token, purpose-built for long-running autonomous agents handling complex orchestration tasks across coding, deep research, and enterprise workflows.

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started