Nous Analysis releases NousCoder-14B: Aggressive Olympic programming mannequin post-trained on Qwen3-14B by way of reinforcement studying

by root January 19, 2026

written by root January 19, 2026 0 comment 153 views

Nous Analysis launched NousCoder-14B, a aggressive Olympic programming mannequin that’s post-trained on Qwen3-14B utilizing reinforcement studying (RL) with verifiable rewards. Within the LiveCodeBench v6 benchmark protecting points from August 1, 2024 to January 5, 2025, the mannequin reaches a Cross@1 accuracy of 67.87 %. That is 7.08 % larger than Qwen3-14B’s baseline of 60.79 % in the identical benchmark. The analysis crew skilled the mannequin on 24,000 verifiable coding issues for 4 days utilizing 48 B200 GPUs and printed the weights on Hugging Face beneath the Apache 2.0 license.

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

Benchmark focus and what Cross@1 means

LiveCodeBench v6 is designed for aggressive programming analysis. The take a look at break up used right here comprises 454 questions. This coaching set makes use of the identical recipe as Agentica and Collectively AI’s DeepCoder-14B mission. This combines points from TACO Verified, PrimeIntellect SYNTHETIC 1, and LiveCodeBench points created earlier than July 31, 2024.

Benchmarks embrace solely aggressive programming type duties. For every drawback, the answer should adhere to strict time and reminiscence limits and move in depth covert enter/output assessments. Cross@1 is the proportion of issues by which the initially generated program passes all assessments, together with time and reminiscence constraints.

Dataset development for execution-based RL

All datasets used for coaching encompass verifiable code technology issues. Every drawback has a reference implementation and lots of take a look at circumstances. The coaching set comprises 24,000 questions extracted from:

Octopus Observe
PrimeIntellect Synthesis 1
LiveCodeBench points that occurred earlier than July 31, 2024

The take a look at set is LiveCodeBench v6 and comprises 454 questions from August 1, 2024 to Could 1, 2025.

All questions are full aggressive programming duties with directions, enter codecs, output codecs, and take a look at circumstances. This setting is necessary for RL as a result of it offers a computationally low cost binary reward sign after code execution.

The RL setting is constructed utilizing the Atropos framework. NousCoder-14B makes use of the usual LiveCodeBench immediate format to show prompts and generate Python code for every drawback. Every rollout receives scalar rewards relying on the take a look at case outcomes.

Reward 1 if the generated code passes all take a look at circumstances for that drawback.
Reward -1 if the code outputs a incorrect reply, exceeds the 15 second time restrict, or exceeds the 4 GB reminiscence restrict in any take a look at case.

To run untrusted code securely and at scale, the crew makes use of Modal as an autoscaled sandbox. The system launches one modal container for every predominant design rollout, which the analysis crew describes as a utilization configuration. Every container runs all take a look at circumstances for its rollout. This avoids mixing coaching and validation computes and retains the RL loop steady.

The analysis crew additionally pipelines inference and validation. When the inference employee finishes producing, it sends its completion to the modal validator, which instantly begins a brand new technology. This design makes use of a set pool of many inference employees and modal containers to maintain the coaching loop’s inference computations on the restrict fairly than the validation restrict.

The crew discusses three verification parallelization methods. Examine one container per problem, one container per rollout, and one container per take a look at case. Finally, we keep away from per-test case configuration because of container startup overhead and use an method the place every container evaluates many take a look at circumstances and focuses on a small set of essentially the most troublesome take a look at circumstances first. If any of those fail, the system might cease validation prematurely.

GRPO goal, DAPO, GSPO, GSPO+

NousCoder-14B makes use of Group Relative Coverage Optimization (GRPO), which doesn’t require separate worth fashions. The analysis crew is conducting assessments primarily based on GRPO Three objectives: Dynamic sAmpling Coverage Optimization (DAPO), Group Sequence Coverage Optimization (GSPO), and a modified GSPO variant known as GSPO+.

All three targets share the identical definition of profit. The advantage of every rollout is the reward for that rollout normalized by the imply and commonplace deviation of the rewards inside the group. DAPO applies significance weighting and clipping on the token stage. Three predominant modifications have been launched in relation to GRPO.

Prime clip guidelines to extend exploration of low chance tokens
Token-level coverage gradient loss that offers equal weight to every token
Dynamic sampling. Teams which can be all appropriate or all incorrect are excluded as a result of they’ve zero profit.

GSPO strikes the significance weighting to the sequence stage. Outline the sequence significance ratio, which is the sum of the token ratios for the complete program. GSPO+ maintains the sequence-level correction, however rescales the gradient in order that tokens are equally weighted no matter sequence size.

In LiveCodeBench v6, the variations between these objectives are small. At a context size of 81,920 tokens, DAPO’s Cross@1 reaches 67.87 %, and GSPO and GSPO+ attain 66.26 % and 66.52 %. At 40,960 tokens, all three objectives are centered round 63 % Cross@1.

Repetitive context enlargement and overly lengthy filtering

Qwen3-14B helps lengthy contexts and coaching follows an iterative context enlargement schedule. The crew first trains the mannequin on 32k context home windows, then continues coaching on as much as 40k Qwen3-14B context home windows. At every stage, we choose the checkpoint with the very best LiveCodeBench rating in 40k contexts and use YaRN context extensions throughout analysis to succeed in 80k tokens, or 81,920 tokens.

The important thing trick is to filter too lengthy. If the generated program exceeds the utmost context window, its benefit is reset to zero. This removes that rollout from the gradient sign fairly than penalizing it. The researchers report that this method avoids pushing the mannequin towards shorter options purely for optimization functions and helps keep high quality when adjusting context size throughout testing.

Vital factors

NousCoder 14B, a Qwen3-14B-based aggressive programming mannequin skilled with execution-based RL, reached 67.87 % Cross@1 on LiveCodeBench v6, 7.08 proportion factors larger than the Qwen3-14B baseline’s 60.79 % on the identical benchmark.
The mannequin was skilled on 24,000 verifiable coding issues from TACO Verified, PrimeIntellect SYNTHETIC-1, and LiveCodeBench duties earlier than July 31, 2024, and evaluated on an impartial LiveCodeBench v6 take a look at set of 454 issues from August 1, 2024 to Could 1, 2025.
The RL setup makes use of Atropos, the Python resolution runs in a sandbox container, offers a easy reward of 1 for fixing all take a look at circumstances, minus 1 for failure or useful resource restrict violation, and makes use of a pipeline design the place inference and validation are carried out asynchronously.
Group Relative Coverage Optimization Targets DAPO, GSPO, and GSPO+ had been used for lengthy context code RL, all working with group normalized rewards, and confirmed comparable efficiency, with DAPO reaching the very best Cross@1 with the longest 81,920 token context.
This coaching makes use of iterative context enlargement initially with 32,000 tokens, then 40,000 tokens, and YaRN-based enlargement as much as 81,920 tokens throughout analysis. It additionally contains very lengthy rollout filtering for stability and ships as a completely reproducible open stack with Apache 2.0 weights and RL pipeline code.

Please examine model weights and technical details. Please be happy to comply with us too Twitter Do not forget to affix us 100,000+ ML subreddits and subscribe our newsletter. cling on! Are you on telegram? You can now also participate by telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of synthetic intelligence for social good. His newest endeavor is the launch of Marktechpost, a synthetic intelligence media platform. It stands out for its thorough protection of machine studying and deep studying information, which is technically sound and simply understood by a large viewers. The platform boasts over 2 million views per 30 days, demonstrating its reputation amongst viewers.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Nous Analysis releases NousCoder-14B: Aggressive Olympic programming mannequin post-trained on Qwen3-14B by way of reinforcement studying

Benchmark focus and what Cross@1 means

Dataset development for execution-based RL

RL setting utilizing Atropos and Modal

GRPO goal, DAPO, GSPO, GSPO+

Repetitive context enlargement and overly lengthy filtering

Vital factors

President Trump proclaims 10% tariffs on Denmark and key European allies over Greenland dispute

NFL-related accounts on Fb publish probably the most shameless AI posts ever

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest

Best selling