Benedikt Stroebl
@benediktstroebl
PhD @Princeton
ID: 1323762441713569792
https://benediktstroebl.github.io/ 03-11-2020 23:02:26
261 Tweet
487 Followers
1,1K Following
🔬 One year on: how close are today’s AI agents to truly accelerating data-driven discovery? We just incorporated ScienceAgentBench into Princeton Center for Information Technology Policy’s Holistic Agent Leaderboard (HAL) and benchmarked the latest frontier LLMs — and we are making progress! 👇 A quick tour of
🔎We also note that higher thinking does not always lead to better performance on ScienceAgentBench, which coincides with the observations on several other benchmarks evaluated in HAL. 📄 Please check out our paper (arxiv.org/abs/2410.05080) and HAL (hal.cs.princeton.edu/scienceagentbe…) for
Fair and comprehensive agent evaluation is hard. I'm just glad that folks like Sayash Kapoor Benedikt Stroebl Arvind Narayanan put in the hard work to iron out and share these thorny issues so you don't have to