We won first place at the Berkeley AgentX summit for the benchmarks and evaluations track! Congrats to the team :)
Daniel Kang
Daniel Kang9.7.2025
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken! Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks misestimate agent competence by 1.6-100%. Why are the evaluation foundations for agentic systems fragile? See below for thread and links 1/8
980