Open source AI Web Agent
We’ve been working hard, cooking up something new to share with you all!
Skyvern 2.0 scored state-of-the-art 85.85% on the WebVoyager Eval.
This is the best-in-class performance of all WebAgents, giving advanced closed-source web agents like Google Mariner a run for their money.
Achieving this SOTA result required expanding Skyvern’s original architecture. Skyvern 1.0 involved a single prompt operating in a loop both making decisions and taking actions on a website. This approach was a good starting point but scored ~45% on the WebVoyager benchmark because it had insufficient memory of previous actions and could not do complex reasoning.
To solve this problem, we created a self-reflection feedback loop within Skyvern. This resulted in 2 main changes:
All tests were run in Skyvern Cloud with an async cloud browser and used a combination of GPT-4o and GPT-4o-mini as the primary decision-making LLMs. The goal of this test is to assert real-world quality — the quality represented by this benchmark is the same as what you would experience with Skyvern’s browsers running asynchronously.
💡 Why is this important? Most benchmarks are run on local browsers with a relatively safe IP address and an impressive browser fingerprint. This is not representative of how Autonomous agents will run in the cloud, and we wanted our benchmark to represent how agents would behave in production
In addition to the above, we’ve made a few minor tweaks to the dataset to bring it up to date:
🔍 For the curious:
The full dataset can be seen here: https://github.com/Skyvern-AI/skyvern/tree/main/evaluation/datasets
The full list of modifications can be seen here: https://github.com/Skyvern-AI/skyvern/pull/1576/commits/60dc48f4cf3b113ff1850e5267a197c84254edf1
We’re doing something out of the ordinary. In addition to the results, we’re making our entire benchmark run public.
💡 Why is this important? Most benchmarks are run behind closed doors, with impressive results being published without any accompanying material to verify the results. This makes it hard to understand how things like hallucinations or website drift over time play an effect on agent performance
We believe this isn’t aligned with our open source mission, and have decided to publish the entire eval results to the public.
📊 All individual run results can be seen here: https://eval.skyvern.com
🔍 The entire Eval dataset can be seen here: https://github.com/Skyvern-AI/skyvern/tree/main/evaluation/datasets
The WebVoyager benchmark is a comprehensive benchmark testing a variety of prompts on 15 different websites. While this acts as a good first step in testing Web agents, this only captures 15 hand-picked websites of the millions of active websites on the internet.
We think there is tremendous opportunity here to better evaluate web agents against one another with a more comprehensive benchmark similar to SWE-Bench.
Browser automation is still a nascent space with tons of room for improvement. While we’ve achieved a major milestone in agent performance, a few important issues are next to be solved: