Sayash Kapoor (@sayashk) 's Twitter Profile
Sayash Kapoor

@sayashk

CS PhD candidate @PrincetonCITP and senior fellow at @Mozilla. I tweet about agents, evaluation, reproducibility, AI for science. Book: aisnakeoil.com

ID: 3084274082

linkhttp://cs.princeton.edu/~sayashk calendar_today15-03-2015 09:03:24

1,1K Tweet

9,9K Followers

1,1K Following

Peter Henderson (@peterhndrsn) 's Twitter Profile Photo

For my workflows, o3 seemed to do better than GPT-5. As a result of Cursor making GPT-5 the preferred model, that experience has deteriorated (imo). These results seem to at least partially back that feeling.

Boyi Wei (@wei_boyi) 's Twitter Profile Photo

We are keep updating results for the latest frontier models on a variety of agentic benchmarks. Check hal.cs.princeton.edu for more detailed analysis!

Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

Excited to partner with the Princeton team on Holistic Agent Leaderboard! Claude continues to be the best choice for agent tasks, but overall we still have a long way to go as a field.

Tianci Xue (@xue_tianci) 's Twitter Profile Photo

Curious about the agentic abilities of today’s ever-emerging models? Wondering how to pick the right one for your complex tasks?🧠 Check out our new Holistic Agent Leaderboard (HAL), collaborating with Princeton. We evaluated 169 agents across 8 popular benchmarks, including

rishi (@rishibommasani) 's Twitter Profile Photo

I finished my reviews for the NeurIPS position track with an average score of 2/10 and top score of 3/10 I support publishing position papers at AI venues, but authors (and reviewers) should realize that the purpose isn't a shortcut for publishing second-rate work at NeurIPS...

Kyle Chan (@kyleichan) 's Twitter Profile Photo

These are really cool AI agent benchmarks. Not just solving math problems but actually reproducing scientific papers with code, for example. Two things to highlight: - Claude Opus 4.1 generally beats GPT-5 - DeepSeek (both V3 and R1) seems very far behind Anthropic and OpenAI

Arvind Narayanan (@random_walker) 's Twitter Profile Photo

I've said it a hundred times but I’ll keep saying it: AI adoption and behavior change are slow — and will stay slow — no matter how fast capabilities improve. The stat in the screenshot is worth pondering: nearly a year after the release of "thinking" models, only a tiny fraction

I've said it a hundred times but I’ll keep saying it: AI adoption and behavior change are slow — and will stay slow — no matter how fast capabilities improve. The stat in the screenshot is worth pondering: nearly a year after the release of "thinking" models, only a tiny fraction
Sayash Kapoor (@sayashk) 's Twitter Profile Photo

GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers. DeepSeek V3 scores 18%. GPT-OSS scores 11%. x.com/natolambert/st…

GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers. 

DeepSeek V3 scores 18%. 
GPT-OSS scores 11%.
x.com/natolambert/st…
Arvind Narayanan (@random_walker) 's Twitter Profile Photo

I really appreciate that AI companies are doing this type of integration work to spur adoption. Note how nonsensical this would be if AI weren't normal technology, that is, if AI companies expected to build AGI or superintelligence in a few years that they thought would sweep

Arvind Narayanan (@random_walker) 's Twitter Profile Photo

I'm giving a couple of online talks at events open to the public today (Wednesday) and tomorrow: - A high-level overview of AI as Normal Technology at the Public AI Summit, starting soon: publicaisummit.com - A talk on AI's impact on science at the Neuro-Symbolic AI summer

Jeremy Howard (@jeremyphoward) 's Twitter Profile Photo

Joshua Achiam Nah that's not what's happening. If it was, I'd be one of the people most excited about recent changes. My long tail of things that models can't do for me hasn't really decreased much in size in recent months.