Pretraining frontier-scale LLMs in FP8 is now customary observe, however shifting to 4-bit floating level has remained an open analysis downside as a result of narrower codecs compress dynamic vary and amplify quantization error at lengthy token horizons. A brand new analysis from NVIDIA describes a pretraining methodology constructed round NVFP4, a 4-bit microscaling format supported natively by Blackwell Tensor Cores, and validates it by pretraining a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. The analysis workforce state that is the longest publicly documented coaching run in 4-bit precision up to now. The ensuing mannequin attains 62.58% on MMLU-Professional 5-shot versus 62.62% for the FP8 baseline, and is supported in NVIDIA’s Transformer Engine.
What NVFP4 Really is
To grasp why NVFP4 is necessary, it helps to revisit how microscaling codecs work. In a microscaling (MX) format, a contiguous block of low-precision parts shares a single scale issue, which is used to map the block again right into a wider numerical vary in the course of the matrix multiply. MXFP4 makes use of 32-element blocks the place every aspect is saved as E2M1 — 1 signal bit, 2 exponent bits, 1 mantissa bit — encoding solely the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6. Block scale components are saved in UE8M0, which restricts them to powers of two.
NVFP4 adjustments three issues. First, the block measurement drops from 32 to 16 parts, narrowing the dynamic vary every scale has to cowl. Second, block scale components are saved in E4M3 quite than UE8M0, buying and selling exponent vary for mantissa precision so the per-block amax (absolute most) could be mapped a lot nearer to the FP4 most representable. Third, NVFP4 provides a second scaling stage: an FP32 per-tensor scale that remaps values so the E4M3 block scales themselves keep in vary. The result’s that not less than 6.25% of values in every block — the per-block amax — are represented at near-FP8 precision, whereas the rest sit in FP4.
On NVIDIA Blackwell, FP4 GEMMs run at 4× BF16 throughput on GB200 and 6× on GB300, which interprets to roughly 2× and three× speedups over FP8. Operand reminiscence footprint is roughly halved in comparison with FP8.

What’s Quantized — and What Isn’t
Solely the GEMMs inside linear (fully-connected) layers Fprop, Dgrad, and Wgrad truly run in NVFP4. Embeddings, the output projection head, normalization layers, non-linearities, and all consideration parts (softmax and the query-key and a spotlight score-value batched GEMMs) keep in BF16 or FP32. Mannequin weights, weight gradients used for accumulation throughout microbatches and data-parallel replicas, and optimizer states are stored in FP32. Tensor parallel reductions run in BF16.
The 4-Half Coaching Methodology
Quantizing each linear-layer GEMM to NVFP4 with default settings (1×16 block scaling in every single place, round-to-nearest-even on each tensor, no transforms) diverges early in coaching. NVIDIA’s method stabilizes it with 4 parts, and ablation research on the 12B mannequin present every is important.
Selective excessive precision: Linear layers within the first two and the ultimate eight of the 62 blocks (about 16% of all linear layers) are stored in BF16. Ablations indicated that the ultimate blocks are the delicate ones as a result of they require extra dynamic vary than FP4 supplies; maintaining solely the ultimate 4 blocks in BF16 was additionally sufficient for steady convergence.
Random Hadamard Transforms (RHT): Outliers in weight gradients are unfold into an roughly Gaussian distribution by multiplying the enter tiles with a 16×16 Hadamard matrix mixed with a random ±1 signal vector. As a result of the orthogonal transforms cancel contained in the dot-product, no math correction is required within the GEMM. The d=16 measurement was chosen empirically: d=4 damage convergence, d=128 gave related outcomes. RHT is utilized solely to the inputs of the weight-gradient (Wgrad) GEMM, and a single random signal vector is shared throughout all linear layers. Randomization itself was a no-op on the 1.2B scale however measurably improved the 12B run.
Two-dimensional (2D) block scaling for weights: Commonplace NVFP4 scales 1×16 blocks alongside the dot-product dimension. As a result of the backward cross transposes the load tensor, the ahead and backward passes find yourself with totally different quantized weights, breaking the chain rule. NVIDIA’s repair is to scale weights in 16×16 blocks so the identical quantized illustration is utilized in each passes. Activations and gradients preserve 1×16 scaling, since they’re much less delicate to this inconsistency.
Stochastic rounding on gradients: Spherical-to-nearest-even introduces systematic bias when utilized to gradient tensors. Stochastic rounding rounds probabilistically primarily based on distance to the 2 nearest representable values, eradicating that bias. The analysis workforce explicitly notes in analysis paper that stochastic rounding is detrimental when utilized to forward-pass tensors, so it’s restricted to gradients.
Outcomes on the 12B Hybrid Mamba-Transformer
The 12B mannequin makes use of the Nemotron-Nano-12B-v2-Base structure — 62 blocks (6 Self-Consideration, 28 FFN, 28 Mamba-2), hidden dimension 5120, FFN dimension 20480 — skilled with a Warmup-Steady-Decay schedule (fixed LR by means of 80% of coaching, decay over the ultimate 20%), batch measurement 736, sequence size 8192. The FP8 reference baseline follows the DeepSeek-V3 methodology: E4M3 parts, 128×128 weight blocks, 1×128 activation and gradient blocks, with the primary block and final two blocks stored in BF16.
NVFP4 validation loss stays inside 1% of the FP8 baseline in the course of the steady part and widens to barely above 1.5% throughout decay. Downstream accuracy is comparable throughout most benchmarks: MMLU 76.57% vs 77.36%, GSM8K CoT 92.27% vs 89.08%, MATH 81.48% vs 83.32%, AGIEval English CoT 70.31% vs 67.01%. Coding reveals the biggest hole — HumanEval+ 57.43% vs 59.93%, MBPP+ 55.91% vs 59.11% — which the analysis workforce attributes partly to noisy final-checkpoint analysis. The analysis workforce additionally paperwork a precision-switching method: transitioning the ahead cross from NVFP4 to BF16 beginning at 8.2T tokens (about 18% of the schedule) diminished relative loss error from 1.5% to 0.5%.
NVFP4 vs MXFP4
On a separate 8B hybrid Mamba-Transformer skilled on 1T tokens, NVFP4 reached a relative loss error of about 1.5% versus BF16, whereas MXFP4 stayed close to 2.5%. To shut the hole, MXFP4 required 1.36T tokens to match the NVFP4 1T-token loss — a 36% token overhead. The analysis workforce attributes the distinction to NVFP4’s smaller block measurement and E4M3 scales, which protect extra of the FP4 dynamic vary than MXFP4’s power-of-two UE8M0 scales (which might waste as much as one binade and the ±4, ±6 samples within the worst case).
Marktechpost’s Visible Explainer
MARKTECHPOST · AI analysis, deeply defined.
Key Takeaways
- NVIDIA’s analysis workforce pretrained a 12B hybrid Mamba-Transformer on 10T tokens in NVFP4 — the longest publicly documented 4-bit coaching run — matching FP8 on MMLU-Professional at 62.58% vs 62.62%.
- NVFP4 makes use of 16-element blocks with E4M3 scales plus an FP32 per-tensor scale, preserving the ±4 and ±6 samples that MXFP4’s 32-element UE8M0 design can lose to power-of-two rounding.
- 4 methods are required for convergence — none are elective: ~16% of linear layers in BF16, 16×16 Random Hadamard Transforms on Wgrad inputs, 2D 16×16 weight scaling, and stochastic rounding on gradients solely.
- Solely linear-layer GEMMs run in NVFP4 — consideration, embeddings, normalization, non-linearities, grasp weights, gradients, and optimizer states all keep in BF16 or FP32.
- On an 8B mannequin, MXFP4 wanted 1.36T tokens (36% extra) to match NVFP4’s loss at 1T tokens, whereas FP4 GEMMs ship 2× FP8 throughput on GB200 and three× on GB300.
Take a look at the Paper here. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us

