Swarnadeep Saha (@swarnanlp) 's Twitter Profile
Swarnadeep Saha

@swarnanlp

Research Scientist @AIatMeta (FAIR) working on Reasoning. Past: @Google PhD fellow @uncnlp. Gooner.

ID: 2485053080

linkhttps://swarnahub.github.io/ calendar_today09-05-2014 08:54:56

603 Tweet

1,1K Takipçi

816 Takip Edilen

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases. This paper proposes J1, a method using reinforcement learning to train LLM judges for improved thinking and reduced bias. Methods 🔧: → Convert judgment tasks, even

Evaluation of LLMs is difficult due to judge models using limited reasoning and suffering from biases.

This paper proposes J1, a method using reinforcement learning to train LLM judges for improved thinking and reduced bias.

Methods  🔧:

→ Convert judgment tasks, even
John Schulman (@johnschulman2) 's Twitter Profile Photo

For people who don't like Claude's behavior here (and I think it's totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to

DAIR.AI (@dair_ai) 's Twitter Profile Photo

3. J1 Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. x.com/jaseweston/sta…

Swarnadeep Saha (@swarnanlp) 's Twitter Profile Photo

Check out our new paper where we compared offline and (Semi-)Online DPO with GRPO for post-training LLMs. This led to some interesting findings! 👇

Swarnadeep Saha (@swarnanlp) 's Twitter Profile Photo

I'm gonna be at #ICML2025 next week to present EvalPlanner (Thursday between 4.30-7 pm). Please reach out if you'd like to talk about reward models, reasoning, synthetic data, and generally the research we're doing in FAIR.

Jason Weston (@jaseweston) 's Twitter Profile Photo

We worked on a whole line of research on this: - Self-Rewarding LMs (use self as a Judge in semi-online DPO): arxiv.org/abs/2401.10020 - Thinking LLMs (learn CoTs with a Judge with semi-online DPO): arxiv.org/abs/2410.10630 *poster at ICML this week!!* - Mix verifiable &

Jason Weston (@jaseweston) 's Twitter Profile Photo

...is today a good day for new paper posts? 🤖Learning to Reason for Factuality 🤖 📝: arxiv.org/abs/2508.05618 - New reward func for GRPO training of long CoTs for *factuality* - Design stops reward hacking by favoring precision, detail AND quality - Improves base model across

...is today a good day for new paper posts? 
🤖Learning to Reason for Factuality 🤖
📝: arxiv.org/abs/2508.05618
- New reward func for GRPO training of long CoTs for *factuality*
- Design stops reward hacking by favoring precision, detail AND quality
- Improves base model across
Zhaopeng Tu (@tuzhaopeng) 's Twitter Profile Photo

Thank you for building on our overthinking and underthinking research! OptimalThinkingBench provides exactly what the field needs - a unified framework to measure the sweet spot between excessive and insufficient reasoning. The finding that current methods improve one aspect

Swarnadeep Saha (@swarnanlp) 's Twitter Profile Photo

Got a new efficient/optimally-thinking LLM? Does you model answer simple queries quickly and spends compute on the harder ones? Test it on our new benchmark, OptimalThinkingBench! 👇 Work led by the amazing Pranjal Aggarwal ✈️ COLM 🍁 during this internship!

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Great AI at Meta paper. Builds a single test that shows when LLMs think too much or too little, then scores both. It targets a gap, reasoning models ramble on easy questions while fast models miss steps on hard ones. They release a benchmark called OptimalThinkingBench with

Great <a href="/AIatMeta/">AI at Meta</a>  paper.

Builds a single test that shows when LLMs think too much or too little, then scores both.

It targets a gap, reasoning models ramble on easy questions while fast models miss steps on hard ones.

They release a benchmark called OptimalThinkingBench 
 with
Justin Chih-Yao Chen (@cyjustinchen) 's Twitter Profile Photo

Excited to share that MAgICoRe has been accepted to #EMNLP2025 main! 🎉 Our work identifies 3 key challenges in LLM refinement for reasoning: 1) Over-correction on easy problems 2) Fail to localize and fix its own errors 3) Too few refinement iterations for harder problems

Swarnadeep Saha (@swarnanlp) 's Twitter Profile Photo

Post-training with RL causes diversity collapse!! We found a way to directly incorporate semantic diversity as an additional reward that improves both quality and diversity of outputs. 👇

Swarnadeep Saha (@swarnanlp) 's Twitter Profile Photo

Turns out we can use RLVR to teach a model to aggregate multiple solutions. Check out our new work on parallel test-time scaling!👇

Gabriel Synnaeve (@syhw) 's Twitter Profile Photo

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

Mohit Bansal (@mohitban47) 's Twitter Profile Photo

🚨 "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> ✨TRAAC✨ is an online RL, difficulty-adaptive, attention-based compression method that prunes

Jason Weston (@jaseweston) 's Twitter Profile Photo

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪
 📝: arxiv.org/abs/2510.07242

- HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method
- Tackles the brittleness of binary signals and the noise of pure reward
Mimansa Jaiswal (@mimansaj) 's Twitter Profile Photo

I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, I’ve focused on understanding why/wehere models fail & how to make them better. I’m looking for opportunities; please reach out!

Jason Weston (@jaseweston) 's Twitter Profile Photo

🌶️SPICE: Self-Play in Corpus Environments🌶️ 📝: arxiv.org/abs/2510.24684 - Challenger creates tasks based on *corpora* - Reasoner solves them - Both trained together ⚔️ -> automatic curriculum! 🔥 Outperforms standard (ungrounded) self-play Grounding fixes hallucination & lack of

🌶️SPICE: Self-Play in Corpus Environments🌶️
📝: arxiv.org/abs/2510.24684
- Challenger creates tasks based on *corpora*
- Reasoner solves them
- Both trained together ⚔️ -&gt; automatic curriculum!
🔥 Outperforms standard (ungrounded) self-play
Grounding fixes hallucination &amp; lack of