Greg Durrett (@gregd_nlp) 's Twitter Profile
Greg Durrett

@gregd_nlp

CS professor at UT Austin. Large language models and NLP. he/him

ID: 938457074278846468

calendar_today06-12-2017 17:16:17

1,1K Tweet

6,6K Takipรงi

797 Takip Edilen

XLLM-Reason-Plan (@xllmreasonplan) 's Twitter Profile Photo

๐Ÿ“ขAnnouncing ๐ญ๐ก๐ž ๐Ÿ๐ข๐ซ๐ฌ๐ญ ๐ฐ๐จ๐ซ๐ค๐ฌ๐ก๐จ๐ฉ ๐จ๐ง ๐ญ๐ก๐ž ๐€๐ฉ๐ฉ๐ฅ๐ข๐œ๐š๐ญ๐ข๐จ๐ง ๐จ๐Ÿ ๐‹๐‹๐Œ ๐„๐ฑ๐ฉ๐ฅ๐š๐ข๐ง๐š๐›๐ข๐ฅ๐ข๐ญ๐ฒ ๐ญ๐จ ๐‘๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  ๐š๐ง๐ ๐๐ฅ๐š๐ง๐ง๐ข๐ง๐  at Conference on Language Modeling! We welcome perspectives from LLM, XAI, and HCI! CFP (Due June 23): โ€ฆreasoning-planning-workshop.github.io

๐Ÿ“ขAnnouncing ๐ญ๐ก๐ž ๐Ÿ๐ข๐ซ๐ฌ๐ญ ๐ฐ๐จ๐ซ๐ค๐ฌ๐ก๐จ๐ฉ ๐จ๐ง ๐ญ๐ก๐ž ๐€๐ฉ๐ฉ๐ฅ๐ข๐œ๐š๐ญ๐ข๐จ๐ง ๐จ๐Ÿ ๐‹๐‹๐Œ ๐„๐ฑ๐ฉ๐ฅ๐š๐ข๐ง๐š๐›๐ข๐ฅ๐ข๐ญ๐ฒ ๐ญ๐จ ๐‘๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  ๐š๐ง๐ ๐๐ฅ๐š๐ง๐ง๐ข๐ง๐  at <a href="/COLM_conf/">Conference on Language Modeling</a>! 
We welcome perspectives from LLM, XAI, and HCI!
CFP (Due June 23): โ€ฆreasoning-planning-workshop.github.io
David Bau (@davidbau) 's Twitter Profile Photo

Dear MAGA friends, I have been worrying about STEM in the US a lot, because right now the Senate is writing new laws that cut 75% of the STEM budget in the US. Sorry for the long post, but the issue is really important, and I want to share what I know about it. The entire

Kanishka Misra ๐ŸŒŠ (@kanishkamisra) 's Twitter Profile Photo

News๐Ÿ—ž๏ธ I will return to UT Austin as an Assistant Professor of Linguistics this fall, and join its vibrant community of Computational Linguists, NLPers, and Cognitive Scientists!๐Ÿค˜ Excited to develop ideas about linguistic and conceptual generalization! Recruitment details soon

News๐Ÿ—ž๏ธ

I will return to UT Austin as an Assistant Professor of Linguistics this fall, and join its vibrant community of Computational Linguists, NLPers, and Cognitive Scientists!๐Ÿค˜

Excited to develop ideas about linguistic and conceptual generalization! Recruitment details soon
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

Great to work on this benchmark with astronomers in our NSF-Simons CosmicAI institute! What I like about it: (1) focus on data processing & visualization, a "bite-sized" AI4Sci task (not automating all of research) (2) eval with VLM-as-a-judge (possible with strong, modern VLMs)

Vaishnavh Nagarajan (@_vaishnavh) 's Twitter Profile Photo

๐Ÿ“ข New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: โ†’ LLMs are limited in creativity since they learn to predict the next token โ†’ creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" ๐ŸŒฑ) 1/ ๐Ÿงต

๐Ÿ“ข New paper on creativity &amp; multi-token prediction! We design minimal open-ended tasks to argue:

โ†’ LLMs are limited in creativity since they learn to predict the next token

โ†’ creativity can be improved via multi-token learning &amp; injecting noise ("seed-conditioning" ๐ŸŒฑ) 1/ ๐Ÿงต
Fangcong Yin (@fangcong_y10593) 's Twitter Profile Photo

Solving complex problems with CoT requires combining different skills. We can do this by: ๐ŸงฉModify the CoT data format to be โ€œcomposableโ€ with other skills ๐Ÿ”ฅTrain models on each skill ๐Ÿ“ŒCombine those models Lead to better 0-shot reasoning on tasks involving skill composition!

Solving complex problems with CoT requires combining different skills.

We can do this by:
๐ŸงฉModify the CoT data format to be โ€œcomposableโ€ with other skills
๐Ÿ”ฅTrain models on each skill
๐Ÿ“ŒCombine those models

Lead to better 0-shot reasoning on tasks involving skill composition!
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

CoT is effective for in-domain reasoning tasks, but Fangcong's work takes a nice step in improving compositional generalization of CoT reasoning. We teach models that atomic CoT skills fit together like puzzle pieces so it can then combine them in novel ways. Lots to do here!

Asher Zheng (@asher_zheng00) 's Twitter Profile Photo

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.๐Ÿ›Ÿ ๐Ÿ‘‰Introducing CoBRA๐Ÿ, a framework that assesses strategic language. Work with my amazing advisors Jessy Li and David Beaver! ๐Ÿงต๐Ÿ‘‡

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.๐Ÿ›Ÿ

๐Ÿ‘‰Introducing CoBRA๐Ÿ, a framework that assesses strategic language.

Work with my amazing advisors <a href="/jessyjli/">Jessy Li</a> and <a href="/David_Beaver/">David Beaver</a>!
๐Ÿงต๐Ÿ‘‡
CosmicAI (@cosmicai_inst) 's Twitter Profile Photo

CosmicAI collab: benchmarking the utility of LLMs in astronomy coding workflows & focusing on the key research capability of scientific visualization. Sebastian Joseph Jessy Li Murtaza Husain Greg Durrett Dr. Stephanie Juneau paul.torrey Adam Bolton, Stella Offner, Juan Frias, Niall Gaffney

Ryan Marten (@ryanmart3n) 's Twitter Profile Photo

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.

We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
Chaitanya Malaviya (@cmalaviya11) 's Twitter Profile Photo

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below ๐Ÿงตโ†“

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses?

Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below ๐Ÿงตโ†“
Bespoke Labs (@bespokelabsai) 's Twitter Profile Photo

Understanding whatโ€™s in the data is a high leverage activity when it comes to training/evaluating models and agents. This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. Our viewer for GPQA (Google

Understanding whatโ€™s in the data is a high leverage activity when it comes to training/evaluating models and agents.

This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. 

Our viewer for GPQA (Google
Xi Ye (@xiye_nlp) 's Twitter Profile Photo

๐Ÿค” Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? ๐Ÿ“ฃ Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: ๐Ÿ” Better head detection: we find a

๐Ÿค” Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval?
๐Ÿ“ฃ Introducing QRHeads (query-focused retrieval heads) that enhance retrieval

Main contributions:
 ๐Ÿ” Better head detection: we find a
Leo Liu (@zeyuliu10) 's Twitter Profile Photo

LLMs trained to memorize new facts canโ€™t use those facts well.๐Ÿค” We apply a hypernetwork to โœ๏ธeditโœ๏ธ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!๐Ÿ’ก Our approach, PropMEND, extends MEND with a new objective for propagation.

LLMs trained to memorize new facts canโ€™t use those facts well.๐Ÿค”

We apply a hypernetwork to โœ๏ธeditโœ๏ธ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!๐Ÿ’ก

Our approach, PropMEND, extends MEND with a new objective for propagation.
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

I'm excited about Leo's use of hypernetworks for data efficient knowledge editing! Tweaking what a model learns from data is very powerful & useful for other goals like alignment. Haven't seen much other work building on MEND recently, but let me know what cool stuff we missed!

Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

If we don't do physical work in our jobs, we go to the gym and work out. What are the gyms for skills that LLMs will automate?

Percy Liang (@percyliang) 's Twitter Profile Photo

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team Tatsunori Hashimoto Marcel Rรธd Neil Band Rohith Kuditipudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

Xi Ye (@xiye_nlp) 's Twitter Profile Photo

Thereโ€™s been hot debate about (The Illusion of) The Illusion of Thinking. My take: itโ€™s not that models canโ€™t reason โ€” they just arenโ€™t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning

David Hall (@dlwh) 's Twitter Profile Photo

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )
Lily Chen (@lilyychenn) 's Twitter Profile Photo

Are we fact-checking medical claims the right way? ๐Ÿฉบ๐Ÿค” Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems. We show whyโ€”and argue fact-checking should be a dialogue, with patients in the loop arxiv.org/abs/2506.20876 ๐Ÿงต1/

Are we fact-checking medical claims the right way? ๐Ÿฉบ๐Ÿค”

Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems.

We show whyโ€”and argue fact-checking should be a dialogue, with patients in the loop

arxiv.org/abs/2506.20876

๐Ÿงต1/