CLS (@chengleisi) Twitter Tweets • TwiCopy

CLS

@chengleisi

+ Follow

PhDing @stanfordnlp | automating scientific research | real AGI is the friends we made along the way

ID: 1025746328603611138

linkhttps://noviscl.github.io/ calendar_today04-08-2018 14:12:36

2,2K Tweet

4,4K Takipçi

3,3K Takip Edilen

Rohan Paul

@rohanpaul_ai

5 months ago

LLM research ideas look shiny on paper but slip when someone actually builds the project. This 131-page work checks whether those projects still look strong once experts run every experiment. It shows a clear drop in quality for the LLM ideas, which means judging ideas only at

thumb_up_off_alt47

chat_bubble_outline1

repeat7

shareShare

CLS

@chengleisi

5 months ago

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

thumb_up_off_alt553

chat_bubble_outline10

repeat162

shareShare

Rohan Pandey

@khoomeik

5 months ago

guys LLMs are trailing by barely half a peer-review point at doing research

thumb_up_off_alt31

chat_bubble_outline0

repeat3

shareShare

Thomas Wolf

@thom_wolf

5 months ago

Can AI ideas hold up in the lab? This study from Stanford says not as well as human ones, but there's hope. With enough training/reasoning, I'm pretty sure LLMs could nail 'small-scale discoveries' not Nobel stuff though Great work CLS Tatsunori Hashimoto Diyi Yang

thumb_up_off_alt31

chat_bubble_outline1

repeat8

shareShare

Chris J. Maddison

@cjmaddison

5 months ago

“Finally, maybe this is controversial but ultimately progress in science is bottlenecked by real-world experiments.” If this is controversial in SF, we’re cooked.

thumb_up_off_alt231

chat_bubble_outline9

repeat22

shareShare

Mercedes Bent

@mercebent

5 months ago

ChengLei has the most creative research projects: he made PhDs execute AI research ideas for months

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Nathan Labenz

@labenz

5 months ago

Amazing follow-up work! After finding that AI research ideas were judged (by human experts) better than human ideas... They tested it by actually executing the research projects! Turns out human ideas are better (judges were wrong!) – but only narrowly & not statistically

thumb_up_off_alt22

chat_bubble_outline0

repeat4

shareShare

alphaXiv

@askalphaxiv

5 months ago

LLMs can generate research ideas that look more novel than humans’, but are they actually better? Stanford ran a study where LLM- or human-authored ideas were tested Human ideas were blindly rated consistently better, with LLM ideas seeing 37× larger score drops post-execution

thumb_up_off_alt199

chat_bubble_outline5

repeat40

shareShare

John Bohannon

@bohannon_bot

5 months ago

July 4th break in our #AI4Science seminar series. Join us next week for a talk by CLS on the epic 2-year experiment evaluating (and executing!) AI-generated scientific ideas. lu.ma/9qq72ebt

thumb_up_off_alt21

chat_bubble_outline0

repeat12

shareShare

Agentica Project

@agentica_

5 months ago

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE

thumb_up_off_alt345

chat_bubble_outline15

repeat65

shareShare

Xiang Yue@ICLR2025🇸🇬

@xiangyue96

5 months ago

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we

thumb_up_off_alt604

chat_bubble_outline14

repeat124

shareShare

Weijia Shi

@weijiashi2

5 months ago

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

thumb_up_off_alt197

chat_bubble_outline7

repeat59

shareShare

Johnny Tian-Zheng Wei

@johntzwei

4 months ago

Are you a researcher, trying to build a small GPU cluster? Did you already build one, and it sucks? I manage USC NLP’s GPU cluster and I’m happy to offer my expertise. I hope I can save you some headaches and make some friends. Please reach out!

thumb_up_off_alt83

chat_bubble_outline4

repeat10

shareShare