Speculative decoding is a method for rushing up the inference of enormous language fashions. A small, quick draft mannequin proposes a number of tokens. Massive goal fashions validate them in parallel. As soon as accepted, reasoning turns into sooner. If rejected, the system gracefully falls again.
The EAGLE, vLLM, and TorchSpec groups launched the EAGLE collection, which incorporates EAGLE 1, EAGLE 2, and EAGLE 3, and has grow to be some of the broadly adopted and deployed speculative decoding algorithm households in each analysis and manufacturing methods. At present, that household is present process focused reliability upgrades with the introduction of the next options: eagle 3.1.
what went improper
Though speculative decoding performs properly in managed settings, it usually performs poorly below completely different chat templates, lengthy context inputs, or system prompts exterior the distribution.
The EAGLE workforce decided that this vulnerability is because of a phenomenon referred to as . deviation of attention Because the depth of hypothesis will increase, drafters step by step shift their consideration away from sink tokens and in direction of independently generated tokens.
Merely put, a drafter is a small mannequin that predicts future tokens. Because the inference deepens, it begins to concentrate to its personal earlier output fairly than the unique context. This reduces acceptance size and output stability.
Two basic issues had been recognized. First, the fused enter illustration turns into more and more unbalanced because the hidden states of the higher layers dominate the drafter enter. Second, the unnormalized residual path will increase the magnitude of hidden states throughout inference steps. These results step by step scale back drafter stability because the depth of hypothesis will increase.
Two structure fixes for EAGLE 3.1
To handle consideration drift, EAGLE 3.1 consists of two vital architectural enhancements. One is FC normalization after every goal hidden state and earlier than the FC layer, and the opposite is to feed the post-norm hidden state to the following decoding step.
FC normalization stabilizes the hidden states that the drafter receives from the goal mannequin. With out this, the magnitude of the hidden state would develop with every step, making the drafter much less and fewer dependable. Making use of normalization at every step retains the enter constrained.
As a result of post-norm design, the strategy acts to name the drafter recursively all through the decoding steps, fairly than merely including extra layers to the goal mannequin.

What these fixes deliver
In comparison with EAGLE 3, EAGLE 3.1 reveals lowered coaching to inference time extrapolation, enhanced long-context robustness, larger resilience to talk template and system immediate variations, and extra secure acceptance durations throughout numerous service environments.
For lengthy context workloads, EAGLE 3.1 achieves as much as 2x longer acceptance lengths in comparison with EAGLE 3.
Coaching Infrastructure: TorchSpec
TorchSpec now gives environment friendly coaching assist for EAGLE 3.1 and future speculative decoding algorithms. By decreasing coaching overhead and simplifying experimentation workflows, TorchSpec helps speed up iteration and exploration for next-generation speculative decoding analysis and deployment.
The analysis workforce additionally skilled and open sourced an EAGLE 3.1 draft mannequin for Kimi K2.6 primarily based on TorchSpec and vLLM. hug face. This mannequin serves for example of deploying EAGLE 3.1 with TorchSpec coaching and vLLM serving assist into an actual serving mannequin.
vLLM integration: configuration-driven and backwards suitable
EAGLE 3.1 is launched to vLLM as a configuration-driven extension of the present EAGLE 3 implementation. This integration consists of FC regularization assist, post-norm hidden state suggestions, and removing of hard-coded assumptions about goal hidden states.
Full backwards compatibility with current EAGLE 3 checkpoints is maintained. EAGLE 3.1 draft fashions may be related immediately by the identical speculative decoding code path.
vllm serve nvidia/Kimi-K2.6-NVFP4
--trust-remote-code
--tensor-parallel-size 4
--tool-call-parser kimi_k2
--enable-auto-tool-choice
--reasoning-parser kimi_k2
--attention-backend tokenspeed_mla
--speculative-config '{"mannequin":"lightseekorg/kimi-k2.6-eagle3.1-mla","methodology":"eagle3","num_speculative_tokens":3}'
--language-model-only
Kim K2.6 benchmark outcomes
The analysis workforce benchmarked the Kim K2.6 EAGLE 3.1 draft mannequin on Kimi-K2.6-NVFP4 utilizing vLLM (TP=4, GB200, non-disag) on the SPEED-Bench coding dataset. EAGLE 3.1 improves output throughput per consumer by 2.03x at concurrency of 1. The speedup stays vital as concurrency will increase (1.71x for C=4 and 1.66x for C=16).
Visible rationalization of Marktechpost
Essential factors
- EAGLE 3.1 fixes deviation of consideration — A newly recognized instability that forestalls drafters from specializing in sink tokens with better speculative depth.
- Two architectural adjustments — FC normalization and Publish-norm hidden state suggestions — Stabilizes the drafter all through the guessing step.
- For long-context workloads, EAGLE 3.1 delivers: As much as twice as lengthy permissible size Examine with EAGLE3.
- Kim-K2.6-NVFP4 Present Benchmarks 2.03× output throughput per consumer At concurrency 1, it drops to 1.66x for C=16.
- eagle 3.1 Backward compatibility with EAGLE 3 checkpoints It has already been merged into vLLM fundamental and can ship in v0.22.0.
Please examine technical details. Additionally, be happy to comply with us Twitter Do not forget to affix us 150,000+ ML subreddits and subscribe our newsletter. grasp on! Are you on telegram? You can now also participate by telegram.
Must accomplice with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and many others.? connect with us

Michal Sutter is a knowledge science professional with a grasp’s diploma in knowledge science from the College of Padova. With a powerful basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking complicated datasets into actionable insights.

