Greg Durrett (@gregd_nlp) Twitter Tweets • TwiCopy

XLLM-Reason-Plan

7 months ago

📢Announcing 𝐭𝐡𝐞 𝐟𝐢𝐫𝐬𝐭 𝐰𝐨𝐫𝐤𝐬𝐡𝐨𝐩 𝐨𝐧 𝐭𝐡𝐞 𝐀𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐋𝐋𝐌 𝐄𝐱𝐩𝐥𝐚𝐢𝐧𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐭𝐨 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 𝐚𝐧𝐝 𝐏𝐥𝐚𝐧𝐧𝐢𝐧𝐠 at Conference on Language Modeling! We welcome perspectives from LLM, XAI, and HCI! CFP (Due June 23): …reasoning-planning-workshop.github.io

📢Announcing 𝐭𝐡𝐞 𝐟𝐢𝐫𝐬𝐭 𝐰𝐨𝐫𝐤𝐬𝐡𝐨𝐩 𝐨𝐧 𝐭𝐡𝐞 𝐀𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐋𝐋𝐌 𝐄𝐱𝐩𝐥𝐚𝐢𝐧𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐭𝐨 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 𝐚𝐧𝐝 𝐏𝐥𝐚𝐧𝐧𝐢𝐧𝐠 at <a href="/COLM_conf/">Conference on Language Modeling</a>!
We welcome perspectives from LLM, XAI, and HCI!
CFP (Due June 23): …reasoning-planning-workshop.github.io

thumb_up_off_alt10

chat_bubble_outline1

repeat2

shareShare

David Bau

@davidbau

6 months ago

Dear MAGA friends, I have been worrying about STEM in the US a lot, because right now the Senate is writing new laws that cut 75% of the STEM budget in the US. Sorry for the long post, but the issue is really important, and I want to share what I know about it. The entire

thumb_up_off_alt466

chat_bubble_outline23

repeat74

shareShare

Kanishka Misra 🌊

@kanishkamisra

6 months ago

News🗞️ I will return to UT Austin as an Assistant Professor of Linguistics this fall, and join its vibrant community of Computational Linguists, NLPers, and Cognitive Scientists!🤘 Excited to develop ideas about linguistic and conceptual generalization! Recruitment details soon

thumb_up_off_alt278

chat_bubble_outline47

repeat19

shareShare

Greg Durrett

@gregd_nlp

6 months ago

Great to work on this benchmark with astronomers in our NSF-Simons CosmicAI institute! What I like about it: (1) focus on data processing & visualization, a "bite-sized" AI4Sci task (not automating all of research) (2) eval with VLM-as-a-judge (possible with strong, modern VLMs)

thumb_up_off_alt25

chat_bubble_outline2

repeat3

shareShare

Vaishnavh Nagarajan

@_vaishnavh

6 months ago

📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵

thumb_up_off_alt137

chat_bubble_outline1

repeat35

shareShare

Fangcong Yin

@fangcong_y10593

6 months ago

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

thumb_up_off_alt87

chat_bubble_outline5

repeat31

shareShare

Greg Durrett

@gregd_nlp

6 months ago

CoT is effective for in-domain reasoning tasks, but Fangcong's work takes a nice step in improving compositional generalization of CoT reasoning. We teach models that atomic CoT skills fit together like puzzle pieces so it can then combine them in novel ways. Lots to do here!

thumb_up_off_alt17

chat_bubble_outline1

repeat1

shareShare

Asher Zheng

@asher_zheng00

6 months ago

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.🛟 👉Introducing CoBRA🐍, a framework that assesses strategic language. Work with my amazing advisors Jessy Li and David Beaver! 🧵👇

thumb_up_off_alt20

chat_bubble_outline2

repeat8

shareShare

CosmicAI

@cosmicai_inst

6 months ago

CosmicAI collab: benchmarking the utility of LLMs in astronomy coding workflows & focusing on the key research capability of scientific visualization. Sebastian Joseph Jessy Li Murtaza Husain Greg Durrett Dr. Stephanie Juneau paul.torrey Adam Bolton, Stella Offner, Juan Frias, Niall Gaffney

thumb_up_off_alt7

chat_bubble_outline0

repeat6

shareShare

Ryan Marten

@ryanmart3n

6 months ago

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

thumb_up_off_alt880

chat_bubble_outline27

repeat181

shareShare

Chaitanya Malaviya

@cmalaviya11

6 months ago

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

thumb_up_off_alt75

chat_bubble_outline1

repeat17

shareShare

Bespoke Labs

@bespokelabsai

6 months ago

Understanding what’s in the data is a high leverage activity when it comes to training/evaluating models and agents. This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. Our viewer for GPQA (Google

thumb_up_off_alt46

chat_bubble_outline1

repeat10

shareShare

Xi Ye

@xiye_nlp

6 months ago

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a

thumb_up_off_alt66

chat_bubble_outline1

repeat17

shareShare

Leo Liu

@zeyuliu10

6 months ago

LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.

thumb_up_off_alt109

chat_bubble_outline3

repeat40

shareShare

Greg Durrett

@gregd_nlp

6 months ago

I'm excited about Leo's use of hypernetworks for data efficient knowledge editing! Tweaking what a model learns from data is very powerful & useful for other goals like alignment. Haven't seen much other work building on MEND recently, but let me know what cool stuff we missed!

thumb_up_off_alt20

chat_bubble_outline1

repeat2

shareShare

Greg Durrett

@gregd_nlp

6 months ago

If we don't do physical work in our jobs, we go to the gym and work out. What are the gyms for skills that LLMs will automate?

thumb_up_off_alt22

chat_bubble_outline6

repeat2

shareShare

Percy Liang

@percyliang

6 months ago

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team Tatsunori Hashimoto Marcel Rød Neil Band Rohith Kuditipudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

thumb_up_off_alt3,3K

chat_bubble_outline31

repeat323

shareShare

Xi Ye

@xiye_nlp

6 months ago

There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning

thumb_up_off_alt34

chat_bubble_outline1

repeat8

shareShare

David Hall

@dlwh

5 months ago

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

thumb_up_off_alt968

chat_bubble_outline21

repeat94

shareShare

Lily Chen

@lilyychenn

5 months ago

Are we fact-checking medical claims the right way? 🩺🤔 Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems. We show why—and argue fact-checking should be a dialogue, with patients in the loop arxiv.org/abs/2506.20876 🧵1/

thumb_up_off_alt22

chat_bubble_outline1

repeat5

shareShare