NVIDIA releases Nemotron-Labs-TwoTower: an open-weight diffusion language mannequin constructed on a frozen autoregressive Nemotron-3-Nano-30B-A3B spine

by root July 1, 2026

written by root July 1, 2026 0 comment 4 views

NVIDIA launched Nemotron-Labs-TwoTowera diffuse language mannequin constructed on a pre-trained autoregressive spine. Shipped as open weight below the NVIDIA Nemotron Open Mannequin License. This launch targets throughput bottlenecks in textual content era.

Autoregressive (AR) fashions decode one token at a time. This sequence of processes limits era throughput. Discrete diffusion language fashions take a distinct route. Tokens are generated in parallel and refined iteratively.

Hottest language fashions use one community for 2 jobs. Represents a clear token and removes the noise of corrupted tokens at each step. TwoTower separates these jobs into two towers. Maintains 98.7% of the whole benchmark high quality of the AR baseline. We additionally report 2.42 occasions increased actual clock era throughput.

TL;DR

TwoTower splits the diffusion right into a frozen AR context tower and a skilled denoiser tower.
Maintains 98.7% of AR high quality with 2.42x throughput (γ=0.8, S=16, 2×H100).
The denoiser was skilled with roughly 2.1T tokens. I used 25T for the spine.
One checkpoint runs diffuse, mock AR, and AR decoding modes.

Nemotron-Labs-TwoTower

TwoTower is a block-wise autoregressive diffusion mannequin. It’s instantiated on an openweight hybrid spine, Nemotron-3-Nano-30B-A3B. This spine interleaves Mamba-2, self-attention, and Blended-of-Consultants (MoE) layers.

Every tower has 52 layers: 23 Mamba-2, 6 self-attention, and 23 MoE. The launched checkpoint ships with each towers, with whole parameters round 60B. The energetic parameters per token are roughly 3B per tower. MoE makes use of 128 routable specialists, of which 6 are energetic and has 2 extra shared specialists.

Each towers begin as copies of the identical spine checkpoint. Solely denoiser towers are skilled. The AR Context Tower will stay frozen. The denoiser was skilled with roughly 2.1T tokens, which is a part of the spine’s 25T token pre-training.

How the 2 towers work

The AR context tower runs causally on immediate and dedicated tokens. Generate per-layer KV cache and last Mamba-2 state. The autoregressive performance of the spine is maintained.

Diffusion denoiser tower improves noisy blocks. Inside a block, bidirectional intrablock consideration is used. Causal relationships stay concerning previous clear blocks.

The towers are linked layer by layer. denoiser layer I Mutual participation within the context tower layer I. This layer-aligned cross-attention permits multiscale entry to the spine illustration. Earlier approaches solely broadcast the final hidden state.

Two extra denoiser adjustments are necessary. The Mamba-2 layer seeds its preliminary state from the Mamba state within the context tower. Diffusion time steps modulate every layer by means of the adaLN single time adjustment. Its adaLN module solely provides as much as 1.5 million parameters.

Era is carried out block by block. Every block begins with S [MASK] token. denoiser refines it T Run the steps after which commit. The context tower then processes the dedicated token and updates the cache.

This explains why a number of denoising steps can beat decoding a single token. Autoregressive decoding commits one token per step. TwoTower commits a number of tokens per step through the early phases of enchancment.

benchmark

The analysis makes use of BF16 on 2×H100 GPUs. The default working factors are dependable unmasking, threshold γ = 0.8, and block dimension S = 16. This desk compares AR baseline and TwoTower diffuse decoding.

job	Nemotron-3-Nano-30B-A3B (AR)	Nemotron-Labs-TwoTower (diffusion)
MMLU (5 photographs, acc)	78.56	78.24
MMLU-Professional (5 photographs, CoT EM)	62.59	60.93
ARC-Problem (25 photographs, acc_norm)	91.72	92.66
Wino Grande (5 photographs, acc)	76.09	76.09
RACE (0 photographs, ACC)	88.90	88.90
HumanEval (0 photographs)	79.27	75.58
MBPP Sanitized (3 photographs)	74.71	74.28
GSM8K (8 photographs, acc)	92.49	90.14
MATH-500 (4 photographs)	84.40	80.60
MMLU International Gentle (5 photographs)	73.97	73.94
MGSM (8 photographs, common acc)	80.80	80.40
keep high quality	100%	98.7%
Era throughput (× AR)	1.0×	2.42×

Normal data stays inside about 1 level of the AR baseline. There may be some deterioration within the code and arithmetic. Frequent sense and multilingual scores are restored or barely improved. Reducing γ commits extra tokens per step, rising throughput however lowering high quality.

Execution: 3 era modes

Checkpoints expose three inference paths. A full two-tower unfold makes use of two GPUs, leading to roughly 59 GB per GPU in BF16. AR-only mode runs on a single 80 GB GPU.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
mannequin = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
)
# context tower -> GPU 0, denoiser tower -> GPU 1
mannequin.place_towers_on_devices("cuda:0", "cuda:1")
mannequin.eval()

immediate = "France is a rustic "
inputs = tokenizer(immediate, return_tensors="pt").to("cuda:0")

outputs = mannequin.generate_mask_diffusion(
    inputs["input_ids"], max_new_tokens=128,
    block_size=16, steps_per_block=16, mask_token_id=3,
    temperature=0.1, confidence_threshold=0.8,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].form[1]:], skip_special_tokens=True))

What are the three modes? generate_mask_diffusion(), generate_mock_ar()and generate_ar(). masks unfold commit max block_size Tokens per step. Mock-AR and AR commit one token per step.

Good place: Use case

Probably the most direct use case is rushing up batch era. Knowledge groups creating artificial textual content could commerce throughput for a slight loss in high quality. At γ=0.8, the commerce is 2.42 occasions quicker and 1.3% high quality.

A second use case is adjusting the tradeoff between high quality and throughput. In accordance with the NVIDIA paper, rising γ preserves extra high quality. Reducing γ commits extra tokens per step to extend pace.

The third use case is drop-in adaptation. The context tower holds the LM head for speculative decoding, verification, or AR scoring. Groups can carry out AR and diffusion from one checkpoint.

Benefits and drawbacks

Strengths:

Open weight below the NVIDIA Nemotron Open Mannequin License. prepared for industrial use
Maintains 98.7% of AR high quality with 2.42x throughput at default working level
Helps diffuse, mock AR, and AR decoding in a single checkpoint
Denoiser was skilled with roughly 2.1T tokens as an alternative of full re-pretraining
Sequence size cache reminiscence is expandable like AR baseline

Weaknesses:

Full 2-tower unfold requires 2 GPUs and as much as 59 GB per GPU in BF16
Code and math drop under basic data (HumanEval 79.27 → 75.58)
Conserving each towers resident will increase reminiscence footprint for fastened mannequin weights
A launched checkpoint is the bottom mannequin earlier than any instruction tuning or adjustment.
Greater than 3x throughput ends in larger high quality loss

interactive explainer

Please examine paper and weight. Please be happy to comply with us too Twitter Do not forget to affix us 150k+ML subreddit and subscribe our newsletter. grasp on! Are you on telegram? You can now also participate by telegram.

Have to companion with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and so forth.? connect with us

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

NVIDIA releases Nemotron-Labs-TwoTower: an open-weight diffusion language mannequin constructed on a frozen autoregressive Nemotron-3-Nano-30B-A3B spine

TL;DR

Nemotron-Labs-TwoTower

How the 2 towers work

benchmark

Execution: 3 era modes

Good place: Use case

Benefits and drawbacks

Strengths:

Weaknesses:

interactive explainer

President Trump’s cryptocurrency earnings will exceed actual property earnings by 2025

Anthropic is restoring entry to Fable at this time

Converter

Editors Pick

Newsletter

Categories

Related Posts