As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken! Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks misestimate agent competence by 1.6-100%. Why are the evaluation foundations for agentic systems fragile? See below for thread and links 1/8
Agentic evaluations differ from traditional ML benchmarks in terms of task formulation and outcome. Agentic benchmarks often rely on fragile simulators (toy websites, databases) potentially with bugs & shortcuts that can skew results. Furthermore, task outcomes of agentic benchmarks have no fixed “gold” labels and often need to judge unstructured answers (code, API calls, long texts.) 3/8
To address these challenges, agentic benchmarks should aim to ensure the correlation between a positive evaluation result and the target AI agents’ capability. We decompose this target into two essential validity criteria: 1. Task Validity: A task is solvable if and only if the agent possesses the target capability. 2. Outcome Validity: The evaluation result is positive if and only if the task is solved. 4/8
Grounded in 17 popular benchmarks (e.g., SWE-bench, OSWorld, TAU-bench, etc.), we develop a 43-item agentic benchmark checklist (ABC) to quickly identify to what extent an agentic benchmark satisfies task and outcome validity ABC: 5/8
We applied ABC to 10 impactful benchmarks that were used to evaluate o3, Gemini 2.5, and Sonnet 4. Here is an overview of our findings: 1. 7/10 benchmarks fail outcome validity 2. 7/10 contain hidden shortcuts/unsolvable tasks 3. only 2/10 disclose known issues Stay tuned. We will soon release more quantitative details and fixes for the issues identified! 6/8
ABC empowers both benchmark & model developers to detect & fix flaws—before headline results. Explore the full checklist, examples, and contribute via our website and GitHub repo to build benchmarks worthy of frontier AI together. 7/8
This is a joint work with @maxYuxuanZhu, @yadapruksachatk, and other folks from Stanford, Berkeley, Yale, Princeton, MIT, Transluce, ML Commons, Amazon, and UK AISI. 8/8
21,95K