Trendaavat aiheet
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken!
Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks misestimate agent competence by 1.6-100%.
Why are the evaluation foundations for agentic systems fragile? See below for thread and links
1/8
Agentic evaluations differ from traditional ML benchmarks in terms of task formulation and outcome.
Agentic benchmarks often rely on fragile simulators (toy websites, databases) potentially with bugs & shortcuts that can skew results. Furthermore, task outcomes of agentic benchmarks have no fixed “gold” labels and often need to judge unstructured answers (code, API calls, long texts.)
3/8
To address these challenges, agentic benchmarks should aim to ensure the correlation between a positive evaluation result and the target AI agents’ capability. We decompose this target into two essential validity criteria:
1. Task Validity: A task is solvable if and only if the agent possesses the target capability.
2. Outcome Validity: The evaluation result is positive if and only if the task is solved.
4/8

Grounded in 17 popular benchmarks (e.g., SWE-bench, OSWorld, TAU-bench, etc.), we develop a 43-item agentic benchmark checklist (ABC) to quickly identify to what extent an agentic benchmark satisfies task and outcome validity
ABC:
5/8
We applied ABC to 10 impactful benchmarks that were used to evaluate o3, Gemini 2.5, and Sonnet 4. Here is an overview of our findings:
1. 7/10 benchmarks fail outcome validity
2. 7/10 contain hidden shortcuts/unsolvable tasks
3. only 2/10 disclose known issues
Stay tuned. We will soon release more quantitative details and fixes for the issues identified!
6/8
ABC empowers both benchmark & model developers to detect & fix flaws—before headline results.
Explore the full checklist, examples, and contribute via our website and GitHub repo to build benchmarks worthy of frontier AI together.
7/8
This is a joint work with @maxYuxuanZhu, @yadapruksachatk, and other folks from Stanford, Berkeley, Yale, Princeton, MIT, Transluce, ML Commons, Amazon, and UK AISI.
8/8
21,95K
Johtavat
Rankkaus
Suosikit