NVIDIA last week unveiled the Nemotron 3 family of open AI models — three new LLM (Nano, Super, Ultra) designed for efficient, multi-agent applications. The core innovation is a hybrid latent Mixture-of-Experts (MoE) architecture: each input token activates only a fraction of the network’s parameters, dramatically boosting throughput and cutting costs.

Nemotron 3 also extends NVIDIA’s open ecosystem by also releasing 3 trillion tokens of pretraining/post-training data and new reinforcement-learning toolkits (NeMo Gym, NeMo RL, etc.) so developers can fine-tune agents on their own data.

Nemotron turns advanced AI into an open platform that gives developers the transparency and efficiency needed to build agentic systems at scale.

Meeting the Multi-Agent AI Challenge

Recent trends show enterprises moving from single chatbot models to multi-agent systems: a multiple specialised AI agents that collaborate or compete to solve complex tasks. But this multi-agent setups introduce new complexities (coordination overhead, context management, unpredictable behaviors).

Nemotron 3 was built to address these complexities. Its hybrid MoE design lets many agents run in parallel with far less waste: by activating only select experts per token, which in turn cuts inference work dramatically. The MoE architecture allows developers to build and deploy multi-agent systems at scale. Moreover, Nemotron 3 aligns with the growing sovereign AI movement: its openness lets organizations and everyone else customize models to local data and regulations.

Model Family

Nemotron 3 comes in three scales tailored to different workloads:

  • Nano: ~30 billion parameters (∼3B tokens active per execution): a lightweight model for targeted tasks like code debugging, summarization and retrieval.

  • Super: ~100 billion parameters (∼10B tokens active per execution): a mid-sized, high-accuracy reasoning model for larger multi-agent workflows.

  • Ultra: ~500 billion parameters (∼50B tokens active per execution): a massive reasoning engine for deep research and long-horizon planning in complex AI applications.

Inside Nemotron 3 Nano: Architecture and Efficiency

Nemotron 3 Nano is not merely a downsized member of the Nemotron 3 family — it serves as the architectural testbed for NVIDIA’s hybrid latent Mixture-of-Experts (MoE) strategy. While the model advertises 30 billion total parameters, its true innovation lies in how those parameters are organized, selectively activated, and routed at inference time.

At a high level, Nemotron 3 Nano integrates three tightly coupled components:

  • A stable dense backbone that preserves the generalization and linguistic robustness of traditional transformer models

  • Latent expert capacity that expands model expressiveness only when the input token demands it

  • A learned routing mechanism, tuned for agentic workloads such as multi-step reasoning, tool invocation, and long-horizon planning

This design directly addresses two long-standing MoE challenges. First, it mitigates routing instability, a common failure mode in early sparse expert models where token-to-expert assignments fluctuate unpredictably. Second, it avoids the memory and bandwidth overhead associated with fully materialized expert layers by keeping most expert computation conditional and latent.

The result is a model that behaves like a dense LLM when needed, yet dynamically unlocks sparse capacity for harder tokens — activating experts only when semantic complexity justifies the extra compute. In practice, Nemotron 3 Nano executes with only a small fraction of its total parameters per token, delivering high throughput without sacrificing reasoning quality.

Crucially, Nano’s efficiency is not just a product of architecture, but also of how it is trained and refined. NVIDIA’s post-training pipeline emphasizes reinforcement learning and multi-environment optimization, aligning the routing policy and expert specialization with real-world agentic tasks rather than static language modeling objectives.

1. Hybrid Mamba–Transformer MoE Architecture

One of the most important — and still under-discussed — innovations in Nemotron 3 is its hybrid Mamba–Transformer Mixture-of-Experts (MoE) backbone. Rather than betting on a single architectural paradigm, Nemotron 3 deliberately combines three complementary approaches into a unified design:

  • Mamba layers for ultra-efficient long-sequence modeling

  • Transformer attention layers for high-precision reasoning

  • Sparse MoE routing for scalable compute and memory efficiency

Nemotron 3 Nano (30B) hybrid architecture. The model alternates sequences of lightweight Mamba-2 layers (for long-range context) and sparse MoE blocks, with occasional Transformer attention layers. https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models

How the Hybrid Backbone Works?

In Nemotron 3 Nano (30B), the backbone alternates between lightweight Mamba-2 layers and sparse MoE blocks, with only a small number of Transformer attention layers inserted where deep reasoning is required.

Although Nano has 30 billion total parameters, it activates only ~3–3.6 billion parameters per token. Each MoE layer contains 128 expert feed-forward networks, but a learned MLP router selects only ~6 experts per token — roughly 10% of the model — keeping inference fast and memory-efficient.

In effect, Nano behaves like a much larger dense model in terms of reasoning quality, while paying only a fraction of the compute cost.

Mamba Layers: Long-Range Memory at Low Cost

Mamba layers are particularly well-suited for long-horizon workloads. Unlike attention, which scales quadratically with context length, Mamba operates in linear time with minimal memory overhead. This allows Nemotron 3 to maintain stable performance across hundreds of thousands to over a million tokens, making it ideal for streaming or continuously evolving contexts.

In multi-agent systems — where agents must repeatedly reference prior state, intermediate plans, or tool outputs — Mamba acts as a durable, cost-efficient memory backbone.

Hybrid MoE Routing

Traditional MoE models route tokens directly to experts based on shallow token-level signals. Nemotron 3 Nano introduces a expert selection phase that enables more nuanced, context-aware routing:

  1. The token embedding flows through the dense backbone

  2. A lightweight routing network evaluates higher-level intent (task type, abstraction level, reasoning depth)

  3. Only ~10% of experts are materialized and activated

  4. Expert outputs are merged back into the residual stream

Key characteristics:

  • ~3B active parameters per token out of 30B total

  • Soft expert selection (not hard switching), improving training and inference stability

  • Routing decisions informed by broader context, not just the local token

Why This Matters for Multi-Agent Systems

This architecture enables task-adaptive compute rather than one-size-fits-all inference:

  • Planner agents activate reasoning-heavy experts

  • Retriever or summarizer agents trigger lightweight semantic experts

  • Tool-calling agents engage structured-output specialists

As a result, Nemotron 3 is especially effective for:

  • Long-horizon agent memory

  • Persistent, multi-step workflows

  • Streaming logs, documents, or telemetry

  • Large RAG contexts without aggressive truncation

By combining Mamba’s efficient memory, Transformer-level reasoning, and latent MoE sparsity, Nemotron 3 delivers a backbone that is not just faster — but fundamentally better aligned with how real multi-agent systems operate at scale.

2. Multi-Environment RL Post-Training

Nano is post-trained across multiple concurrent environments in NeMo Gym an open-source library for building and scaling RL environments:

  • Reasoning

  • Tool use

  • Safety alignment

  • Agent cooperation

This avoids over-optimizing for single-prompt benchmarks and instead improves policy robustness in agent loops.

3. Token-Efficiency Optimization

Unlike models trained to think longer, Nano is explicitly optimized to:

  • Reach correct answers with fewer intermediate tokens

  • Avoid verbose chain-of-thought unless required

4. Memory, KV-Cache, and 1M Long-Context Handling

Enabling a 1M token context window is not just an attention-scaling problem : it’s a systems and memory-management challenge. Nemotron 3 Nano tackles this with a set of architectural optimizations designed specifically for long-running, multi-turn agent workloads:

  • Segmented attention limits quadratic memory and compute growth, keeping long sequences tractable.

  • KV-cache reuse across agent turns allows state to persist without reprocessing the entire context on every step.

  • Expert-aware caching ensures that only active experts allocate memory, while inactive experts remain effectively free.

For agent builders, this changes what’s practical in production. Instead of relying on brittle chunking, retrieval heuristics, or aggressive summarization, Nano enables:

  • Long-lived agent memory that persists across extended interactions

  • Persistent world models for simulations, planning, and tool-using agents

  • True multi-document reasoning over large corpora without context fragmentation

These capabilities are particularly valuable for research agents, codebase-aware copilots, and enterprise knowledge-graph traversal, where continuity and global context matter more than short-form response latency.

Why Nemotron 3 Nano Excels at Multi-Agent Workloads

Multi-agent systems stress language models very differently than chatbots. Instead of a single conversational thread, they generate highly parallel, bursty workloads with frequent context reuse and long-running control loops. Typical characteristics include:

  • Many concurrent, short-lived prompts

  • Frequent handoffs between specialized agents

  • Reuse of shared memory across agents

  • Long-running workflows with branching logic

Dense models struggle in this regime because every agent incurs full-model compute per token, causing costs and latency to scale almost linearly with agent count. Nemotron 3 Nano is designed specifically for this workload profile.

1. System-Level Advantages for Agentic Workloads

Nemotron 3 Nano delivers several properties that map directly to multi-agent system requirements:

  • High tokens/sec throughput: Ensures agents do not block each other when executing in parallel.

  • Reduced reasoning-token generation: Prevents cost amplification in agent loops where intermediate reasoning dominates inference spend.

  • Sparse expert reuse across agents: Multiple agents often activate the same small subset of experts, improving GPU cache locality and overall utilization.

  • 1M-token context window: Enables a shared memory space across entire agent graphs — plans, tool outputs, code, and documents — without constant truncation or repacking.

Practical impact:

  • Run 10–50 concurrent agents

  • On a single GPU

  • Without linear cost scaling

This is where dense models typically collapse.

2. Hybrid Sparse Architecture: Why Nano Scales Better

These gains are further amplified by Nano’s hybrid Mamba + Transformer + MoE architecture:

  • Mamba layers efficiently model very long sequences, making large shared contexts practical.

  • Sparse Mixture-of-Experts routing activates only a fraction of parameters per token, keeping compute costs low even under heavy concurrency.

  • Occasional attention layers preserve high-precision reasoning when needed.

This allows Nano to ingest: Multi-hour conversations, Entire code repositories, Large document collections all in a single forward pass.

3. Performance and Cost Efficiency

According to NVIDIA and external evaluations:

  • ~4x higher token throughput than Nemotron 2 Nano

  • ~60% fewer reasoning tokens generated, directly reducing inference cost

  • ~377 tokens/sec on an H200 GPU for an 8K-input / 16K-output workload

  • Outperforms GPT-OSS-20B and other models in the ~30B parameter class under agentic workloads

4. Explicit Control Over Reasoning

Nano exposes fine-grained controls that are critical for production agent systems:

  • Reasoning ON mode: Maintains internal reasoning state across turns for multi-step planning and agent coordination.

  • Reasoning OFF mode: Resets reasoning every turn to minimize latency and token usage.

  • Configurable thinking-token budget: Caps internal reasoning tokens, keeping costs predictable even in complex agent loops.

These controls are a key reason Nano achieves ~60% lower reasoning-token usage than its predecessor while preserving accuracy.

Bottom line: Nemotron 3 Nano doesn’t just run faster — it changes the scaling curve for multi-agent systems, making high-concurrency, long-context, agentic workflows economically viable on a single GPU.

Nemotron 3 Super & Ultra: Advanced Features

The Nemotron 3 Super (~100B parameters, ~10B active) and Ultra (~500B, ~50B active) models build on the Nano foundation with further efficiency tricks.

Latent MoE Architecture:

Before routing, tokens are projected into a smaller latent space, which drastically reduces inter-GPU communication. This trick lets the model consult many more experts without extra cost (e.g. up to 8 experts per layer instead of 4). The result is a more powerful yet still scalable MoE: it improves specialization around complex semantics and multi-hop reasoning while keeping latency low.

Multi-token prediction (MTP)

Super/Ultra also adopt multi-token prediction (MTP): each forward pass predicts several future tokens at once, forcing the model to plan ahead. This speeds up long-form generation (e.g. drafting multi-sentence outputs) and boosts training efficiency.

NVFP4 training

Crucially, Nemotron 3 Super and Ultra were pretrained using NVIDIA’s NVFP4 4-bit floating-point format on Blackwell GPUs. NVFP4 quantizes most of the model to 4 bits (with higher precision in critical layers), roughly halving memory and compute vs. FP16 without significant accuracy loss.

In practice this means teams can fine-tune 100B+ models on existing GPU clusters where a dense transformer would be impossible.

Nemotron 3 Super and Ultra are scheduled to launch in early 2026, giving organizations time to prototype on Nano and then scale up for very large-scale planning tasks

Efficiency Gains and Comparisons

Independent benchmarks reinforce this position. Artificial Analysis ranks Nemotron 3 Nano ahead of all other open 30B-scale models on both efficiency and accuracy, placing it at the top of its peer group for real-world usability. On their composite Intelligence Index, Nano scores 52 points, outperforming the majority of open models in the same size range an unusual result for a model optimized primarily for efficiency.

For larger variants, Nemotron’s advantage becomes even more structural. Nemotron 3 Super and Ultra are trained using low-precision techniques combined with sparse expert routing, allowing enterprises to fine-tune and deploy them with far fewer Blackwell GPUs than comparable dense models. Rather than brute-force scaling, NVIDIA focuses on reasoning-per-token — optimizing what each unit of compute actually delivers.

The result is a model family designed explicitly for tokenomics: maximizing reasoning capability per token generated and per GPU second consumed.

Performance and Benchmarks

Performance data places Nemotron 3 Nano firmly in the upper-right sweet spot of accuracy vs. throughput. In NVIDIA’s internal evaluations, Nano matches or exceeds the accuracy of models like Qwen3–30B and GPT-OSS-20B, while running substantially faster.

In long-context tests on an H200 GPU, Nano achieved:

  • ~3.3× higher throughput than Qwen3–30B

  • ~2.2× higher throughput than GPT-OSS-20B

Third-party benchmarks confirm these gains. Reported inference speeds reach ~377 tokens per second, significantly outpacing other open 30B models. This is not a marginal improvement — it is a step-change enabled by sparse MoE routing and hybrid sequence modeling.

Crucially, these efficiency gains do not come at the expense of task quality. Across multi-step reasoning, coding, and question-answering benchmarks, Nemotron 3 Nano consistently delivers accuracy on par with — or exceeding — larger and more expensive open models. Its strong results are further amplified by NVIDIA’s large-scale, multi-environment reinforcement learning pipeline, while remaining fully open and customizable.

Practical Implications for Agentic Systems

In practice, Nemotron 3 enables a hybrid agent strategy: use Nemotron for the majority of everyday reasoning, orchestration, and tool-using tasks, and route only the most complex or high-stakes queries to frontier proprietary models.

This approach allows teams to:

  • Cut inference costs dramatically

  • Maintain high reasoning quality

  • Retain transparency and control over fine-tuning

  • Scale multi-agent systems without runaway GPU spend

Rather than replacing frontier models outright, Nemotron 3 complements them, acting as the cost-efficient backbone for agent pipelines. For organizations building large-scale, multi-agent AI systems, this balance of efficiency, openness, and performance is where Nemotron 3 delivers its strongest advantage.

Availability and Ecosystem

Nemotron 3 Nano is available now for developers. NVIDIA has published Nano on Hugging Face and is available on major inference platforms, including lamatic.ai.

How to add any huggingface models in lamatic: https://lamatic.ai/integrations/models/huggingface

The larger Nemotron 3 Super and Ultra models are slated for release in early 2026, giving teams time to experiment on Nano and then scale up their agents to higher-end reasoning.

Summary

The Nemotron 3 family delivers an open, high-performance foundation for next-generation multi-agent AI. Built on a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, it activates only a small fraction of parameters per token — dramatically improving throughput and reducing cost.

Nemotron 3 Nano pairs a 30B model with just ~3B active parameters, 4-bit efficiency, and a massive 1M-token context window, enabling long-horizon reasoning at up to 4x higher throughput than Nemotron 2 Nano while generating ~60% fewer tokens.

Super and Ultra extend this design to 100B+ and 500B scale models using NVFP4 4-bit training on Blackwell GPUs, cutting memory requirements nearly in half without sacrificing accuracy.

Combined with 3T tokens of open training data, new RL environments, and broad integration across popular inference stacks, Nemotron 3 transforms advanced AI into an open, customizable platform — making large-scale, multi-agent systems far cheaper and easier to build and deploy.

References

Keep Reading

No posts found