Trendaavat aiheet
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
We won first place at the Berkeley AgentX summit for the benchmarks and evaluations track! Congrats to the team :)


9.7.2025
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken!
Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks misestimate agent competence by 1.6-100%.
Why are the evaluation foundations for agentic systems fragile? See below for thread and links
1/8
980
Johtavat
Rankkaus
Suosikit