Sayash Kapoor (@sayashk) Twitter Tweets • TwiCopy

Sayash Kapoor

@sayashk

+ Follow

CS PhD candidate @PrincetonCITP and senior fellow at @Mozilla. I tweet about agents, evaluation, reproducibility, AI for science. Book: aisnakeoil.com

ID: 3084274082

linkhttp://cs.princeton.edu/~sayashk calendar_today15-03-2015 09:03:24

1,1K Tweet

9,9K Followers

1,1K Following

Peter Henderson

@peterhndrsn

3 months ago

For my workflows, o3 seemed to do better than GPT-5. As a result of Cursor making GPT-5 the preferred model, that experience has deteriorated (imo). These results seem to at least partially back that feeling.

thumb_up_off_alt14

chat_bubble_outline0

repeat2

shareShare

Boyi Wei

@wei_boyi

3 months ago

We are keep updating results for the latest frontier models on a variety of agentic benchmarks. Check hal.cs.princeton.edu for more detailed analysis!

thumb_up_off_alt14

chat_bubble_outline0

repeat3

shareShare

Franck SN

@ndzfs

3 months ago

If you are doing a PhD in a scientific domain, use Claude 4.1 Opus. Go into debt if you have to.

thumb_up_off_alt19

chat_bubble_outline2

repeat1

shareShare

Yu Su @#ICLR2025

@ysu_nlp

3 months ago

Excited to partner with the Princeton team on Holistic Agent Leaderboard! Claude continues to be the best choice for agent tasks, but overall we still have a long way to go as a field.

thumb_up_off_alt19

chat_bubble_outline1

repeat3

shareShare

Tianci Xue

@xue_tianci

3 months ago

Curious about the agentic abilities of today’s ever-emerging models? Wondering how to pick the right one for your complex tasks?🧠 Check out our new Holistic Agent Leaderboard (HAL), collaborating with Princeton. We evaluated 169 agents across 8 popular benchmarks, including

thumb_up_off_alt12

chat_bubble_outline0

repeat3

shareShare

rishi

@rishibommasani

3 months ago

I finished my reviews for the NeurIPS position track with an average score of 2/10 and top score of 3/10 I support publishing position papers at AI venues, but authors (and reviewers) should realize that the purpose isn't a shortcut for publishing second-rate work at NeurIPS...

thumb_up_off_alt87

chat_bubble_outline5

repeat9

shareShare

Kyle Chan

@kyleichan

3 months ago

These are really cool AI agent benchmarks. Not just solving math problems but actually reproducing scientific papers with code, for example. Two things to highlight: - Claude Opus 4.1 generally beats GPT-5 - DeepSeek (both V3 and R1) seems very far behind Anthropic and OpenAI

thumb_up_off_alt71

chat_bubble_outline10

repeat12

shareShare

Arvind Narayanan

@random_walker

3 months ago

I've said it a hundred times but I’ll keep saying it: AI adoption and behavior change are slow — and will stay slow — no matter how fast capabilities improve. The stat in the screenshot is worth pondering: nearly a year after the release of "thinking" models, only a tiny fraction

$I've said it a hundred times but I’ll keep saying it: AI adoption and behavior change are slow — and will stay slow — no matter how fast capabilities improve. The stat in the screenshot is worth pondering: nearly a year after the release of "thinking" models, only a tiny fraction$

thumb_up_off_alt282

chat_bubble_outline19

repeat49

shareShare

Sayash Kapoor

@sayashk

3 months ago

GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers. DeepSeek V3 scores 18%. GPT-OSS scores 11%. x.com/natolambert/st…

thumb_up_off_alt35

chat_bubble_outline1

repeat5

shareShare

Arvind Narayanan

@random_walker

3 months ago

I really appreciate that AI companies are doing this type of integration work to spur adoption. Note how nonsensical this would be if AI weren't normal technology, that is, if AI companies expected to build AGI or superintelligence in a few years that they thought would sweep

thumb_up_off_alt81

chat_bubble_outline4

repeat6

shareShare

Arvind Narayanan

@random_walker

3 months ago

I'm giving a couple of online talks at events open to the public today (Wednesday) and tomorrow: - A high-level overview of AI as Normal Technology at the Public AI Summit, starting soon: publicaisummit.com - A talk on AI's impact on science at the Neuro-Symbolic AI summer

thumb_up_off_alt24

chat_bubble_outline4

repeat6

shareShare

Jeremy Howard

@jeremyphoward

3 months ago

Joshua Achiam Nah that's not what's happening. If it was, I'd be one of the people most excited about recent changes. My long tail of things that models can't do for me hasn't really decreased much in size in recent months.

thumb_up_off_alt63

chat_bubble_outline4

repeat2

shareShare