StepFun introduces Step-DeepResearch, a 32B parameter end-to-end deep analysis agent geared toward turning net searches into actual analysis workflows with long-term inference, software utilization, and structured reporting. The mannequin is constructed on Qwen2.5 32B-Base and is educated to behave as a single agent to plan, discover sources, look at proof, and write experiences with citations whereas protecting inference prices low.
From search to deep analysis
Most current net brokers are tuned for multihop query answering benchmarks. They attempt to match truthful solutions to quick questions. That is extra of a focused search than precise analysis. The duties of deep analysis are completely different. These embrace recognizing latent intentions, making choices over time, utilizing multi-turn instruments, structured reasoning, and cross-source validation underneath uncertainty.
Step-DeepResearch restructures this as sequential decision-making over a compact set of atomic options. The analysis workforce defines 4 primary capabilities: planning and job decomposition, deep info exploration, reflection and validation, {and professional} report technology. As a substitute of coordinating many exterior brokers, the system internalizes this loop right into a single mannequin, figuring out the subsequent motion at every step.
Information synthesis centered on atomic capabilities
To show these primary capabilities, the analysis workforce builds separate information pipelines for every ability. Making a plan begins with high-quality technical experiences, analysis experiences, and monetary evaluation paperwork. Reverse engineer sensible analysis plans and job timber from titles, abstracts, and buildings and generate trajectories that observe these plans. This exposes the mannequin to long-term challenge buildings, not simply quick query templates.
Construct graph-based queries in opposition to information graphs resembling Wikidata5m and CN-DBpedia for deep info. Pattern subgraphs, broaden them utilizing search, and synthesize questions that require multihop inference throughout entities and paperwork. One other pipeline makes use of a Wiki-style hyperlink index to drive cross-document searches and proof mixtures. Coaching focuses on troublesome search issues, excluding straightforward questions that highly effective fashions can already resolve with easy ReAct-style methods.
Reflection and validation information are generated by means of self-correcting loops and multi-agent instructor tracing. The instructor agent extracts claims, plans checks, confirms the information, reschedules if discrepancies come up, after which writes a report. The ensuing trajectory is cleaned up and used as a single scholar agent monitor. Report technology is educated in two phases. Intermediate coaching makes use of question report pairs to coach area type and depth, adopted by supervised fine-tuning with strict formatting and plan consistency constraints.
Progressive coaching with Qwen2.5-32B-Base
The coaching pipeline has three levels: agent intermediate coaching, supervised fine-tuning, and reinforcement studying. In intermediate coaching stage 1, the workforce injects atomic performance with out instruments, utilizing context lengths as much as 32,000 tokens. Information contains lively studying, artificial reasoning traces, summaries, and reflections. The analysis workforce exhibits regular positive factors with SimpleQA, TriviaQA, and FRAMES as they scale their coaching as much as about 150 billion tokens, with FRAMES, which emphasizes structured inference, seeing probably the most positive factors.
Stage 2 expands the context to 128k tokens and introduces specific software calls. The mannequin learns duties resembling URL-based query answering, deep net search, lengthy doc summarization, and lengthy dialog reasoning. At this stage, we adapt the mannequin to real-world analysis situations the place search, searching, and evaluation must be blended in a single trajectory.
Throughout supervised fine-tuning, 4 atomic options are integrated into the whole deep search and deep exploration hint. Information cleansing maintains correct and quick trajectories for steps and gear calls. This pipeline injects managed software errors, then makes corrections to enhance robustness, and enforces quotation formatting in order that experiences are primarily based on retrieved sources.
Reinforcement studying then optimizes the agent in an actual software atmosphere. The analysis workforce builds duties and checklists by means of retrosynthesis and trains checklist-style rubric judges to attain experiences alongside fine-grained dimensions. The reward design transforms the ternary rubric label into an uneven binary reward that captures each constructive objectives and violations. This coverage is educated by PPO and a realized critic utilizing a generalized profit estimate with near-zero low cost to keep away from truncating lengthy trajectories.
Single agent ReAct structure and search stack
Throughout inference, Step-DeepResearch runs as a single ReAct-style agent, alternating between pondering, calling instruments, and observing till it decides on the report output. This software set contains batch net searches, job managers, shell instructions, and file operations. Execution is carried out in a sandbox with terminal persistence through tmux. Perceptually oriented browsers use perceptual hash distance to scale back redundant web page captures. Instruments for doc evaluation, audio transcription, and picture evaluation assist multimodal enter.
Two associated sources are used for info retrieval. The StepFun workforce says its search API relies on over 20 million high-quality papers and 600 premium indexes. Subsequent, the analysis workforce describes a rigorously chosen authoritative indexing technique that isolates over 600 trusted domains, together with authorities, educational, and institutional websites. Searches are carried out on the paragraph stage and use authority-aware rankings to prioritize extra authoritative domains when their relevance is comparable.
The Information software helps patch-based modifying, so the agent can replace solely the modified sections of the report. A summary-aware storage scheme writes the whole software output to an area file and inserts solely a compact abstract into the context. This acts as exterior reminiscence and avoids context overflow for lengthy tasks.
Analysis, price and entry
To measure deep analysis conduct, the workforce launched ADR-Bench, a Chinese language benchmark with 110 open-ended duties throughout 9 domains. The 70 duties cowl frequent areas resembling training, science, engineering, and social life, and are rated side-by-side by consultants. The 40 duties in finance and regulation are scored with specific rubrics in line with atomicity and verifiability constraints.
On-scale AI analysis rubricStep-DeepResearch reaches a rubric compliance charge of 61.42%, which is similar to OpenAI-DeepResearch and Gemini-DeepResearch, and clearly outperforms a number of open and proprietary baselines. At ADR-Bench, expert-based Elo evaluations present that the 32B mannequin outperforms bigger open fashions resembling MiniMax-M2, GLM-4.6, and DeepSeek-V3.2 and might compete with methods resembling Kimi-Researcher and MiniMax-Agent-Professional.
Necessary factors
- Single agent, atomic function design: Step-DeepResearch is a 32B-parameter single agent constructed on Qwen2.-32B-Base, which internalizes 4 primary capabilities: planning, detailed info exploration, reflection and validation, {and professional} report technology, reasonably than counting on many exterior brokers.
- Focused information synthesis for every ability: The analysis workforce makes use of plans reverse engineered from actual experiences, graph-based queries on Wikidata5m and CN-DBpedia, multi-agent instructor traces, and rigorous report format information to construct separate information pipelines for planning, deep info exploration, reflection, and reporting.
- 3-step coaching with lengthy context and RL: Coaching makes use of intermediate coaching, supervised fine-tuning and reinforcement studying, with as much as 150B tokens with 32,000 and 128,000 contexts in intermediate coaching, SFT composes a whole deep analysis trajectory, and PPO-based RL with rubric judges optimizes experiences in opposition to fine-grained checklists.
- ReAct structure with curated search and exterior reminiscence: Throughout inference, the mannequin runs a ReAct loop that calls instruments for batch net search, todos, shells, and file operations, makes use of a search API primarily based on over 20 million papers and 600+ trusted domains, and depends on patch modifying and synopsis-aware storage to function exterior reminiscence.
- Aggressive high quality and low price: On the Scale AI Analysis rubric, the mannequin reaches a rubric compliance charge of 61.42 p.c, making it aggressive with OpenAI-DeepResearch and Gemini-DeepResearch, and on the ADR bench, it achieves a 67.1 p.c win charge or tie in opposition to a robust baseline.
Please test paper and lipo. Please be happy to observe us too Twitter Remember to hitch us 100,000+ ML subreddits and subscribe our newsletter. dangle on! Are you on telegram? You can now also participate by telegram.

