Microsoft Analysis releases Webwright: Terminal-native net agent framework that scores 60.1% on Odyssey, beating base GPT-5.4’s 33.5%

by root May 24, 2026

written by root May 24, 2026 0 comment 53 views

Presently, most net brokers execute the browser one motion at a time. The mannequin receives the present web page state as a screenshot or DOM textual content and predicts the subsequent click on, keypress, or scroll. This action-at-a-time design made sense when the language mannequin had restricted inference capabilities. As fashions enhance our skill to put in writing and debug code, their inflexible loops have develop into extra of a constraint than a useful assemble.

Microsoft Analysis’s AI Frontiers lab has constructed a special method. Their new open supply framework is net mildgiving the agent a terminal as a substitute of a stateful browser session. The agent writes Playwright code to regulate the browser, runs bash instructions, inspects logs, and iteratively adjusts scripts. Playwright is an open supply browser automation library, additionally from Microsoft, that helps programmatic management of Chromium, Firefox, and WebKit browsers.

Webwright distinction

net mild It separates the agent from the browser and treats the browser as one thing that the agent can launch, examine, and destroy throughout program improvement. Persistent artifacts are native workspace code and logs, not browser classes.

This is similar mannequin that builders use when writing RPA (Robotic Course of Automation) scripts. Create a script as soon as as a substitute of manually clicking the location every time. This script might be rerun, adjusted, and shared. Webwright applies this to brokers utilizing LLM.

The system has Three core parts: Runners, mannequin endpoints, and terminal environments. The runner is about 150 traces of code, the mannequin interface is about 550 traces, and the setting is about 300 traces. There is no such thing as a multi-agent orchestration or advanced planning hierarchy, only a single agent loop.

All intermediate code, logs, screenshots, and outcomes are saved to your workspace, so you possibly can simply examine every run.

https://www.microsoft.com/en-us/analysis/articles/webwright-a-terminal-is-all-you-need-for-web-agents/

agent loop

The runner sends the present context to the mannequin. The mannequin returns thought blocks and shell instructions. This command runs within the setting and returns terminal output, logs, screenshots, or error tracebacks. These observations are introduced again into context and the loop continues.

Moderately than issuing one primary motion at a time, coding brokers can naturally categorical multi-step interactions, reminiscent of deciding on a date or filling out a whole kind, as compact packages. Loops, features, and abstractions permit brokers to generalize throughout related duties with out repeatedly predicting related sequences of low-level steps.

Two engineering challenges

Untimely “achieved” and context explosion are two core issues. Open-ended bash actions require fashions to self-report completion, typically claiming success with out really finishing. They added a gate. The agent generates a self-reflective configuration, runs a closing script in a brand new folder with logs and screenshots, and should go its personal self-reflective resolution that outputs success or failure earlier than publishing. achieved: true. In any other case, the flag shall be eliminated and retried.

Concerning the context size, a protracted coding trajectory rapidly exceeds the context restrict, so each 20 steps of the historical past are compressed into one abstract.

Benchmark outcomes

Webwright was evaluated on two benchmarks: On-line-Mind2Web and Odyssey.

On-line-Mind2Web contains 300 duties throughout 136 broadly used websites and makes use of the automated LLM-as-a-Decide evaluation framework. GPT-5.4 achieves an general accuracy of 86.67% and represents the perfect amongst all open-source harness recipes within the AutoEval class of the On-line-Mind2Web benchmark with a price range of 100 steps. Claude Opus 4.7 reached 84.7% general, however carried out higher on troublesome duties with N=100 steps (80.5% versus 76.6% for GPT-5.4).

We additionally replicated the GPT-5.4 baseline with a standard screenshot-based agent configuration, and the mannequin predicted the X,Y coordinates of click on and enter actions. Utilizing the identical underlying mannequin, Webwright achieves important positive aspects in all three problem classes, highlighting the advantages of a code-driven, terminal-based method to step-by-step coordinate prediction.

Odyssey evaluates long-term looking duties that span a number of web sites. Activity directions common 272.3 phrases. Within the April 2026 leaderboard, the perfect performing mannequin was Opus 4.6 with a most rating of 44.5. Webwright with GPT-5.4 reached 60.1%, reaching a relative enchancment of 35.1% in comparison with the earlier state-of-the-art. In comparison with GPT-5.4’s base efficiency of 33.5%, this corresponds to a relative enchancment of 79.4%, or 26.6 absolute factors.

price evaluation

Claude Opus 4.7 is extra environment friendly within the variety of steps to resolve every job (21.9 steps on common) in comparison with GPT-5.4 (26.3 steps on common). Nevertheless, Claude Opus 4.7 is priced considerably increased in comparison with GPT-5.4 (April 2026, $5 vs. $2.50 per million enter tokens, $25 vs. $15.00 per million output tokens) and has the next common price per job in comparison with GPT-5.4 ($2.37 vs. $6.09). The primary 50 steps offer you 82% accuracy, and the subsequent 50 steps offer you an extra 3-4 factors.

Small mannequin efficiency

The analysis workforce additionally examined Qwen3.5-9B with a tough break up in On-line-Mind2Web. Augmenting the duty with pre-built, reusable device scripts, Qwen3.5-9B achieves 66.2% on On-line-Mind2Web web sites that use 5 or extra instruments. This reveals that smaller, lower-cost fashions can deal with advanced net duties when mixed with pre-built device libraries.

Visible rationalization of Marktechpost

net mild
fast begin information

01/05 — Overview
What’s Weblight?
Webwright is an open supply terminal-native net agent framework. microsoft analysis. Moderately than predicting one browser click on at a time, the agent playwright Write code, run bash instructions, and save reusable scripts to your native workspace.

~1,000 traces Harness code throughout 3 modules – no hidden orchestration
single agent loop: runner, mannequin endpoint, terminal setting
86.7% On-line-Mind2Web | 60.1% About Odyssey with GPT-5.4
Backend: OpenAI, Anthropic, OpenRouter
reusable script in Claude Code, Codex, OpenClaw

# GitHub repository
github.com/microsoft/Webwright

02/05 — Conditions
What you want earlier than putting in
Earlier than working the set up command, guarantee that you’ve got the next in place:

Python 3.10+ — Minimal runtime required
chromium — Set up by way of Playwright within the subsequent step
API key — OpenAI, Anthropic, or OpenRouter
Git — Clone the repository

# Examine your Python model
python --version
# Should return Python 3.10 or increased

03 / 05 — Set up
Clone and set up Webwright
Clone the repository, set up in editable mode, after which set up Chromium for Playwright browser management.

# 1. Clone the repository
git clone https://github.com/microsoft/Webwright
cd Webwright

# 2. Set up the package deal in editable mode
pip set up -e .

# 3. Set up Chromium for Playwright
playwright set up chromium

of -e This flag implies that native supply edits are utilized instantly with out reinstallation.

04/05 — Activity execution
Run your first net job
Export the API key and go the duty directions and begin URL to the CLI.

# Export your key
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Run a job
python -m webwright.run.cli 
  -c base.yaml -c model_openai.yaml 
  -t "Discover least expensive financial system flight SEA to JFK on 2026-05-15" 
  --start-url https://www.google.com/flights 
  --task-id demo_openai 
  -o outputs/default

flag	rationalization
-c	Configuration information from src/webwright/config/ — stackable
-t	Work directions in easy-to-understand English
–Begin URL	Preliminary URL of browser session
–Activity ID	Output subfolder title
-o	Root output listing for logs and scripts

05/05 — Claude code integration
Use Webwright as a Claude Code Ability
Webwright comes with a built-in Claude Code ability. No separate LLM API secret is required apart from a Claude Code subscription. Claude Code reads PNG screenshots natively.

# Venture-scoped (inside this repo solely)
mkdir -p .claude/expertise .claude/instructions
ln -s "$PWD/expertise/webwright" .claude/expertise/webwright
ln -s "$PWD/expertise/webwright/instructions" .claude/instructions/webwright

# Person-scoped (all initiatives)
mkdir -p ~/.claude/expertise ~/.claude/instructions
ln -s "$PWD/expertise/webwright" ~/.claude/expertise/webwright
ln -s "$PWD/expertise/webwright/instructions" ~/.claude/instructions/webwright

Restart Claude Code after set up and use the slash command.

# One-shot job
/webwright:run search Google Flights SEA to JFK 2026-05-15

# Reusable parameterized CLI device
/webwright:craft search a ticket from LAX to SFO depart June 7

Necessary factors

Moderately than predicting browser actions one by one, Webwright makes use of a terminal loop wherein the agent writes and executes Playwright code.
GPT-5.4 reached 86.7% on On-line-Mind2Web (100-step price range) and 60.1% on Odyssey. That is 26.6 factors increased than GPT-5.4’s base rating of 33.5%.
The harness has as much as 1,000 traces throughout three modules with out multi-agent orchestration.
Qwen3.5-9B reached 66.2% exhausting break up for On-line-Mind2Web when enhanced with pre-built device scripts.
Activity scripts are packaged as reusable CLIs that may be shared between Claude Code, Codex, and OpenClaw.

Please verify lipo and technical details. Please be happy to comply with us too Twitter Remember to affix us 150,000+ ML subreddits and subscribe our newsletter. hold on! Are you on telegram? You can now also participate by telegram.

Must associate with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and so on.? connect with us

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Microsoft Analysis releases Webwright: Terminal-native net agent framework that scores 60.1% on Odyssey, beating base GPT-5.4’s 33.5%

Webwright distinction

agent loop

Two engineering challenges

Benchmark outcomes

price evaluation

Small mannequin efficiency

Visible rationalization of Marktechpost

Necessary factors

BTC falls as bond yields rise

Why garlic repels mosquitoes and prevents them from breeding

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply