How far can medium-scale language fashions go if actual innovation strikes from the spine to the agent scaffolding and toolstack? Researchers at Meta and Harvard College have launched Confucius Code Agent, an open-source AI software program engineer constructed on the Confucius SDK designed for industrial-scale software program repositories and long-running periods. The system targets real-world GitHub tasks, advanced testing toolchains throughout analysis, and reproducible outcomes on benchmarks similar to SWE Bench Professional and SWE Bench Verified, whereas exposing full scaffolding to builders.

Confucius SDK, scaffolding across the mannequin
The Confucius SDK is an agent growth platform that treats scaffolding as a core design downside, somewhat than a skinny wrapper round language fashions. It’s composed of three axes, Agent expertise, person expertise, developer expertise.
Agent expertise Management what the mannequin sees, together with context structure, working reminiscence, and power outcomes. Consumer expertise It focuses on readable traces, code diffs, and security measures for human engineers. Developer expertise It focuses on observability, configuration, and debugging of the agent itself.
The SDK introduces three core mechanisms: a unified orchestrator with hierarchical working reminiscence, a persistent note-taking system, and a modular enlargement interface for instruments. The meta-agent then automates the synthesis and refinement of agent configurations by a build-test-improve loop. Confucius Code Agent is one concrete instantiation of this scaffolding in software program engineering.


Hierarchical working reminiscence for long-term coding
Actual software program duties in SWE Bench Professional typically require reasoning over dozens of information and lots of interplay steps. The Confucius SDK’s orchestrator maintains a hierarchical working reminiscence, partitioning trajectories into scopes, summarizing previous steps, and preserving compressed context for later turns.
This design helps maintain prompts inside the limits of the mannequin context whereas preserving necessary artifacts similar to patches, error logs, and design selections. The important thing level is that an efficient tool-based coding agent requires an specific reminiscence structure, not only a sliding window of earlier messages.
Persistent note-taking for cross-session studying
The second mechanism is a note-taking system that makes use of a devoted agent to write down structured Markdown notes from execution traces. These notes report task-specific methods, repository guidelines, and customary failure modes and are saved as long-term reminiscence that may be reused between periods.
The analysis group ran Confucius Code Agent twice on 151 SWE Bench Professional cases with Claude 4.5 Sonnet. On the primary run, the agent solves the duty from scratch and generates notes. On the second run, the agent reads these notes. With this configuration, the typical variety of turns decreases from 64 to 61, token utilization decreases from roughly 104k to 93k, and Resolve@1 improves from 53.0 to 54.4. This exhibits that notes will not be simply logs, however act as efficient session-to-session reminiscence.
Superior utilization of modular extensions and instruments
The Confucius SDK exposes instruments for file enhancing, command execution, take a look at runners, code search, and extra as extensions. Every extension can keep its personal state and facilitate wiring.
The analysis group is finding out the impression of accelerating sophistication of device use with ablation on a pattern subset of 100 SWE Bench Professional. On Claude 4 Sonnet, Resolve@1 will increase from 42.0 to 48.6 when shifting from a configuration with out the Superior Context function to a configuration with Superior Context. In Claude 4.5 Sonnet, easy tooling configuration reaches 44.0, richer tooling reaches 51.6, and intermediate variations attain 51.0. These numbers present that how the agent chooses and orders its instruments is nearly as necessary as the selection of the spine mannequin.


Meta-agent for computerized agent design
Along with these mechanisms, the Confucius SDK features a meta-agent that takes the pure language specification of an agent and iterates by configurations, prompts, and extension units. Then run candidate brokers towards the duty, examine traces and metrics, and edit configurations in a build-test-improve loop.
The Confucius Code Agent evaluated by the analysis group is created with the assistance of this meta-agent along with guide changes. This method turns a part of the agent engineering course of itself into an LLM-guided optimization downside.
SWE Bench Professional and SWE Bench Verified outcomes
The primary analysis makes use of SWE Bench Professional. This has 731 GitHub points that require modifications to the precise repository till the exams cross. All programs in contrast share the identical repository, tooling setting, and analysis harness, so the variations lie within the scaffolding and mannequin.
The Resolve@1 scores reported by SWE Bench Professional are:
- Claude 4 Sonnet with SWE Agent, 42.7
- Claude 4 Sonnet with Confucius Code Agent, 45.5
- Claude 4.5 Sonnet with SWE Agent, 43.6
- Claude 4.5 Sonnet (with reside SWE agent), 45.8
- Claude 4.5 Sonnet with Confucius Code Agent, 52.7
- Claude 4.5 Opus, Anthropic System Card Scaffold, 52.0
- Claude 4.5 Opus (with Confucius Code Agent), 54.3
These outcomes present that the middle-tier mannequin, a robust scaffold, Claude 4.5 Sonnet with Confucius Code Agent 52.7, can outperform a weaker scaffold, sturdy mannequin, Claude 4.5 Opus 52.0.
In SWE Bench Verified, Confucius Code Agent with Claude 4 Sonnet reached Resolve@1 74.6. This compares to 66.6 for SWE Agent and 72.8 for OpenHands. The Mini SWE Agent variant with Claude 4.5 Sonnet reaches 70.6, which can be beneath the Confucius Code Agent with Claude 4 Sonnet.
The analysis group additionally reviews efficiency as a perform of the variety of edited information. For duties of enhancing 1-2 information, Confucius Code Agent’s Resolve@1 reaches 57.8, for 3-4 information it reaches 49.2, for 5-6 information it reaches 44.1, for 7-10 information it reaches 52.6, and for greater than 10 information it reaches 44.4. This exhibits secure habits for a number of file modifications in giant codebases.
Vital factors
- Scaffolding could exceed mannequin measurement: Confucius Code Agent exhibits that when utilizing sturdy scaffolding, Claude 4.5 Sonnet reached 52.7 Resolve@1 in SWE-Bench-Professional, outperforming Claude 4.5 Opus’ 52.0 utilizing weak scaffolding.
- Hierarchical working reminiscence is crucial for long-term coding: Quite than counting on a easy rolling historical past, the Confucius SDK orchestrator makes use of hierarchical working reminiscence and context compression to handle lengthy trajectories throughout giant repositories.
- Persistent notes function efficient session-to-session reminiscence: For 151 SWE-Bench-Professional duties utilizing Claude 4.5 Sonnet, reusing structured notes reduces turns from 64 to 61, token utilization from roughly 104k to 93k, and Resolve@1 will increase from 53.0 to 54.4.
- Instrument configuration vastly influences success fee: For the SWE-Bench-Professional subset of 100 duties, Resolve@1 will increase from 44.0 to 51.6 when shifting from easy to richer tooling with Claude 4.5 Sonnet. This exhibits that the device’s realized routing and restoration methods are key efficiency levers, somewhat than simply implementation particulars.
- Meta-agent automates agent design and tuning: Meta brokers iteratively recommend prompts, device units, and configurations, evaluating and enhancing them in a build-test-improve loop. The manufacturing Confucius Code Agent itself is generated by this course of along with guide tuning.
Please verify Click here for the paper. Additionally, be happy to comply with us Twitter Remember to hitch us 100,000+ ML subreddits and subscribe our newsletter. grasp on! Are you on telegram? You can now also participate by telegram.
Try the newest releases ai2025.devis a 2025-focused analytics platform that transforms mannequin launches, benchmarks, and ecosystem exercise into structured datasets that may be filtered, in contrast, and exported.
Asif Razzaq is the CEO of Marktechpost Media Inc. Asif is a visionary entrepreneur and engineer dedicated to harnessing the potential of synthetic intelligence for social good. His newest endeavor is the launch of Marktechpost, a synthetic intelligence media platform. It stands out for its thorough protection of machine studying and deep studying information, which is technically sound and simply understood by a large viewers. The platform boasts over 2 million views monthly, demonstrating its reputation amongst viewers.

