Trend-Themen
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Wir haben den ersten Platz beim Berkeley AgentX-Gipfel in der Kategorie Benchmarks und Bewertungen gewonnen! Glückwunsch an das Team :)


9. Juli 2025
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken!
Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks misestimate agent competence by 1.6-100%.
Why are the evaluation foundations for agentic systems fragile? See below for thread and links
1/8
999
Top
Ranking
Favoriten