CLS (@chengleisi) 's Twitter Profile
CLS

@chengleisi

PhDing @stanfordnlp | automating scientific research | real AGI is the friends we made along the way

ID: 1025746328603611138

linkhttps://noviscl.github.io/ calendar_today04-08-2018 14:12:36

2,2K Tweet

4,4K Followers

3,3K Following

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

LLM research ideas look shiny on paper but slip when someone actually builds the project. This 131-page work checks whether those projects still look strong once experts run every experiment. It shows a clear drop in quality for the LLM ideas, which means judging ideas only at

LLM research ideas look shiny on paper but slip when someone actually builds the project.

This 131-page work checks whether those projects still look strong once experts run every experiment.

It shows a clear drop in quality for the LLM ideas, which means judging ideas only at
CLS (@chengleisi) 's Twitter Profile Photo

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

Are AI scientists already better than human researchers?

We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts.

Main finding: LLM ideas result in worse projects than human ideas.
Thomas Wolf (@thom_wolf) 's Twitter Profile Photo

Can AI ideas hold up in the lab? This study from Stanford says not as well as human ones, but there's hope. With enough training/reasoning, I'm pretty sure LLMs could nail 'small-scale discoveries' not Nobel stuff though Great work CLS Tatsunori Hashimoto Diyi Yang

Chris J. Maddison (@cjmaddison) 's Twitter Profile Photo

“Finally, maybe this is controversial but ultimately progress in science is bottlenecked by real-world experiments.” If this is controversial in SF, we’re cooked.

Nathan Labenz (@labenz) 's Twitter Profile Photo

Amazing follow-up work! After finding that AI research ideas were judged (by human experts) better than human ideas... They tested it by actually executing the research projects! Turns out human ideas are better (judges were wrong!) – but only narrowly & not statistically

alphaXiv (@askalphaxiv) 's Twitter Profile Photo

LLMs can generate research ideas that look more novel than humans’, but are they actually better? Stanford ran a study where LLM- or human-authored ideas were tested Human ideas were blindly rated consistently better, with LLM ideas seeing 37× larger score drops post-execution

LLMs can generate research ideas that look more novel than humans’, but are they actually better?

Stanford ran a study where LLM- or human-authored ideas were tested

Human ideas were blindly rated consistently better, with LLM ideas seeing 37× larger score drops post-execution
John Bohannon (@bohannon_bot) 's Twitter Profile Photo

July 4th break in our #AI4Science seminar series. Join us next week for a talk by CLS on the epic 2-year experiment evaluating (and executing!) AI-generated scientific ideas. lu.ma/9qq72ebt

Agentica Project (@agentica_) 's Twitter Profile Photo

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models.

💪DeepSWE
Xiang Yue@ICLR2025🇸🇬 (@xiangyue96) 's Twitter Profile Photo

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true?

In our study (arxiv.org/pdf/2507.00432), we
Weijia Shi (@weijiashi2) 's Twitter Profile Photo

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

Johnny Tian-Zheng Wei (@johntzwei) 's Twitter Profile Photo

Are you a researcher, trying to build a small GPU cluster? Did you already build one, and it sucks? I manage USC NLP’s GPU cluster and I’m happy to offer my expertise. I hope I can save you some headaches and make some friends. Please reach out!