NVIDIA launched Nemotron-Labs-TwoTowera diffuse language mannequin constructed on a pre-trained autoregressive spine. Shipped as open weight below the NVIDIA Nemotron Open Mannequin License. This launch targets throughput bottlenecks in textual content era.
Autoregressive (AR) fashions decode one token at a time. This sequence of processes limits era throughput. Discrete diffusion language fashions take a distinct route. Tokens are generated in parallel and refined iteratively.
Hottest language fashions use one community for 2 jobs. Represents a clear token and removes the noise of corrupted tokens at each step. TwoTower separates these jobs into two towers. Maintains 98.7% of the whole benchmark high quality of the AR baseline. We additionally report 2.42 occasions increased actual clock era throughput.
TL;DR
- TwoTower splits the diffusion right into a frozen AR context tower and a skilled denoiser tower.
- Maintains 98.7% of AR high quality with 2.42x throughput (γ=0.8, S=16, 2×H100).
- The denoiser was skilled with roughly 2.1T tokens. I used 25T for the spine.
- One checkpoint runs diffuse, mock AR, and AR decoding modes.
Nemotron-Labs-TwoTower
TwoTower is a block-wise autoregressive diffusion mannequin. It’s instantiated on an openweight hybrid spine, Nemotron-3-Nano-30B-A3B. This spine interleaves Mamba-2, self-attention, and Blended-of-Consultants (MoE) layers.
Every tower has 52 layers: 23 Mamba-2, 6 self-attention, and 23 MoE. The launched checkpoint ships with each towers, with whole parameters round 60B. The energetic parameters per token are roughly 3B per tower. MoE makes use of 128 routable specialists, of which 6 are energetic and has 2 extra shared specialists.
Each towers begin as copies of the identical spine checkpoint. Solely denoiser towers are skilled. The AR Context Tower will stay frozen. The denoiser was skilled with roughly 2.1T tokens, which is a part of the spine’s 25T token pre-training.
How the 2 towers work
The AR context tower runs causally on immediate and dedicated tokens. Generate per-layer KV cache and last Mamba-2 state. The autoregressive performance of the spine is maintained.
Diffusion denoiser tower improves noisy blocks. Inside a block, bidirectional intrablock consideration is used. Causal relationships stay concerning previous clear blocks.
The towers are linked layer by layer. denoiser layer I Mutual participation within the context tower layer I. This layer-aligned cross-attention permits multiscale entry to the spine illustration. Earlier approaches solely broadcast the final hidden state.
Two extra denoiser adjustments are necessary. The Mamba-2 layer seeds its preliminary state from the Mamba state within the context tower. Diffusion time steps modulate every layer by means of the adaLN single time adjustment. Its adaLN module solely provides as much as 1.5 million parameters.
Era is carried out block by block. Every block begins with S [MASK] token. denoiser refines it T Run the steps after which commit. The context tower then processes the dedicated token and updates the cache.
This explains why a number of denoising steps can beat decoding a single token. Autoregressive decoding commits one token per step. TwoTower commits a number of tokens per step through the early phases of enchancment.
benchmark
The analysis makes use of BF16 on 2×H100 GPUs. The default working factors are dependable unmasking, threshold γ = 0.8, and block dimension S = 16. This desk compares AR baseline and TwoTower diffuse decoding.
| job | Nemotron-3-Nano-30B-A3B (AR) | Nemotron-Labs-TwoTower (diffusion) |
|---|---|---|
| MMLU (5 photographs, acc) | 78.56 | 78.24 |
| MMLU-Professional (5 photographs, CoT EM) | 62.59 | 60.93 |
| ARC-Problem (25 photographs, acc_norm) | 91.72 | 92.66 |
| Wino Grande (5 photographs, acc) | 76.09 | 76.09 |
| RACE (0 photographs, ACC) | 88.90 | 88.90 |
| HumanEval (0 photographs) | 79.27 | 75.58 |
| MBPP Sanitized (3 photographs) | 74.71 | 74.28 |
| GSM8K (8 photographs, acc) | 92.49 | 90.14 |
| MATH-500 (4 photographs) | 84.40 | 80.60 |
| MMLU International Gentle (5 photographs) | 73.97 | 73.94 |
| MGSM (8 photographs, common acc) | 80.80 | 80.40 |
| keep high quality | 100% | 98.7% |
| Era throughput (× AR) | 1.0× | 2.42× |
Normal data stays inside about 1 level of the AR baseline. There may be some deterioration within the code and arithmetic. Frequent sense and multilingual scores are restored or barely improved. Reducing γ commits extra tokens per step, rising throughput however lowering high quality.
Execution: 3 era modes
Checkpoints expose three inference paths. A full two-tower unfold makes use of two GPUs, leading to roughly 59 GB per GPU in BF16. AR-only mode runs on a single 80 GB GPU.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
mannequin = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
)
# context tower -> GPU 0, denoiser tower -> GPU 1
mannequin.place_towers_on_devices("cuda:0", "cuda:1")
mannequin.eval()
immediate = "France is a rustic "
inputs = tokenizer(immediate, return_tensors="pt").to("cuda:0")
outputs = mannequin.generate_mask_diffusion(
inputs["input_ids"], max_new_tokens=128,
block_size=16, steps_per_block=16, mask_token_id=3,
temperature=0.1, confidence_threshold=0.8,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].form[1]:], skip_special_tokens=True))
What are the three modes? generate_mask_diffusion(), generate_mock_ar()and generate_ar(). masks unfold commit max block_size Tokens per step. Mock-AR and AR commit one token per step.
Good place: Use case
Probably the most direct use case is rushing up batch era. Knowledge groups creating artificial textual content could commerce throughput for a slight loss in high quality. At γ=0.8, the commerce is 2.42 occasions quicker and 1.3% high quality.
A second use case is adjusting the tradeoff between high quality and throughput. In accordance with the NVIDIA paper, rising γ preserves extra high quality. Reducing γ commits extra tokens per step to extend pace.
The third use case is drop-in adaptation. The context tower holds the LM head for speculative decoding, verification, or AR scoring. Groups can carry out AR and diffusion from one checkpoint.
Benefits and drawbacks
Strengths:
- Open weight below the NVIDIA Nemotron Open Mannequin License. prepared for industrial use
- Maintains 98.7% of AR high quality with 2.42x throughput at default working level
- Helps diffuse, mock AR, and AR decoding in a single checkpoint
- Denoiser was skilled with roughly 2.1T tokens as an alternative of full re-pretraining
- Sequence size cache reminiscence is expandable like AR baseline
Weaknesses:
- Full 2-tower unfold requires 2 GPUs and as much as 59 GB per GPU in BF16
- Code and math drop under basic data (HumanEval 79.27 → 75.58)
- Conserving each towers resident will increase reminiscence footprint for fastened mannequin weights
- A launched checkpoint is the bottom mannequin earlier than any instruction tuning or adjustment.
- Greater than 3x throughput ends in larger high quality loss
interactive explainer
Please examine paper and weight. Please be happy to comply with us too Twitter Do not forget to affix us 150k+ML subreddit and subscribe our newsletter. grasp on! Are you on telegram? You can now also participate by telegram.
Have to companion with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and so forth.? connect with us

