فزنا بالمركز الأول في قمة Berkeley AgentX لمسار المعايير والتقييمات! تهانينا للفريق :)
Daniel Kang
Daniel Kang‏9 يوليو 2025
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken! Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks misestimate agent competence by 1.6-100%. Why are the evaluation foundations for agentic systems fragile? See below for thread and links 1/8
‏‎1‏K