Swarnadeep Saha (@swarnanlp) Twitter Tweets • TwiCopy

Rohan Paul

6 months ago

Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper proposes J1, a method using reinforcement learning to train LLM judges for improved thinking and reduced bias. Methods 🔧: → Convert judgment tasks, even

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

John Schulman

@johnschulman2

6 months ago

For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to

thumb_up_off_alt706

chat_bubble_outline130

repeat42

shareShare

DAIR.AI

@dair_ai

6 months ago

3. J1 Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. x.com/jaseweston/sta…

thumb_up_off_alt13

chat_bubble_outline1

repeat1

shareShare

Swarnadeep Saha

@swarnanlp

5 months ago

Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! 👇

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

Swarnadeep Saha

@swarnanlp

4 months ago

I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.

thumb_up_off_alt63

chat_bubble_outline0

repeat6

shareShare

Jason Weston

@jaseweston

4 months ago

We worked on a whole line of research on this: - Self-Rewarding LMs (use self as a Judge in semi-online DPO): arxiv.org/abs/2401.10020 - Thinking LLMs (learn CoTs with a Judge with semi-online DPO): arxiv.org/abs/2410.10630 *poster at ICML this week!!* - Mix verifiable &

thumb_up_off_alt200

chat_bubble_outline0

repeat23

shareShare

Jason Weston

@jaseweston

3 months ago

...is today a good day for new paper posts? 🤖Learning to Reason for Factuality 🤖 📝: arxiv.org/abs/2508.05618 - New reward func for GRPO training of long CoTs for *factuality* - Design stops reward hacking by favoring precision, detail AND quality - Improves base model across

thumb_up_off_alt359

chat_bubble_outline1

repeat44

shareShare

Zhaopeng Tu

@tuzhaopeng

3 months ago

Thank you for building on our overthinking and underthinking research! OptimalThinkingBench provides exactly what the field needs - a unified framework to measure the sweet spot between excessive and insufficient reasoning. The finding that current methods improve one aspect

thumb_up_off_alt19

chat_bubble_outline2

repeat2

shareShare

Swarnadeep Saha

@swarnanlp

3 months ago

Got a new efficient/optimally-thinking LLM? Does you model answer simple queries quickly and spends compute on the harder ones? Test it on our new benchmark, OptimalThinkingBench! 👇 Work led by the amazing Pranjal Aggarwal ✈️ COLM 🍁 during this internship!

thumb_up_off_alt79

chat_bubble_outline0

repeat10

shareShare

Rohan Paul

@rohanpaul_ai

3 months ago

Great AI at Meta paper. Builds a single test that shows when LLMs think too much or too little, then scores both. It targets a gap, reasoning models ramble on easy questions while fast models miss steps on hard ones. They release a benchmark called OptimalThinkingBench with

Great <a href="/AIatMeta/">AI at Meta</a> paper.

Builds a single test that shows when LLMs think too much or too little, then scores both.

It targets a gap, reasoning models ramble on easy questions while fast models miss steps on hard ones.

They release a benchmark called OptimalThinkingBench
with

thumb_up_off_alt19

chat_bubble_outline2

repeat5

shareShare

Justin Chih-Yao Chen

@cyjustinchen

3 months ago

Excited to share that MAgICoRe has been accepted to #EMNLP2025 main! 🎉 Our work identifies 3 key challenges in LLM refinement for reasoning: 1) Over-correction on easy problems 2) Fail to localize and fix its own errors 3) Too few refinement iterations for harder problems

thumb_up_off_alt98

chat_bubble_outline0

repeat36

shareShare

Swarnadeep Saha

@swarnanlp

3 months ago

Post-training with RL causes diversity collapse!! We found a way to directly incorporate semantic diversity as an additional reward that improves both quality and diversity of outputs. 👇

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare

Swarnadeep Saha

@swarnanlp

2 months ago

Turns out we can use RLVR to teach a model to aggregate multiple solutions. Check out our new work on parallel test-time scaling!👇

thumb_up_off_alt19

chat_bubble_outline0

repeat0

shareShare

Gabriel Synnaeve

@syhw

2 months ago

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

thumb_up_off_alt1,1K

chat_bubble_outline56

repeat262

shareShare

Mohit Bansal

@mohitban47

a month ago

🚨 "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> ✨TRAAC✨ is an online RL, difficulty-adaptive, attention-based compression method that prunes

thumb_up_off_alt79

chat_bubble_outline1

repeat15

shareShare

Swarnadeep Saha

@swarnanlp

a month ago

Hope all attendees enjoyed the workshop as much as we did in organizing it!

thumb_up_off_alt14

chat_bubble_outline0

repeat0

shareShare

Jason Weston

@jaseweston

a month ago

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward

thumb_up_off_alt322

chat_bubble_outline4

repeat53

shareShare

Mimansa Jaiswal

@mimansaj

a month ago

I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, I’ve focused on understanding why/wehere models fail & how to make them better. I’m looking for opportunities; please reach out!

thumb_up_off_alt2,2K

chat_bubble_outline95

repeat205

shareShare

Swarnadeep Saha

@swarnanlp

a month ago

Yi Lin is a brilliant researcher with abundance of knowledge in different aspects of LLM/VLM training. Hire him!

thumb_up_off_alt24

chat_bubble_outline0

repeat4

shareShare

Jason Weston

@jaseweston

23 days ago

🌶️SPICE: Self-Play in Corpus Environments🌶️ 📝: arxiv.org/abs/2510.24684 - Challenger creates tasks based on *corpora* - Reasoner solves them - Both trained together ⚔️ -> automatic curriculum! 🔥 Outperforms standard (ungrounded) self-play Grounding fixes hallucination & lack of

thumb_up_off_alt322

chat_bubble_outline7

repeat55

shareShare