Weizhe Yuan (@weizhey) 's Twitter Profile
Weizhe Yuan

@weizhey

Ph.D. at @nyuniversity. Visiting researcher at @AIatMeta. Previous Intern @cohere, MCDS @LTIatCMU. Working on ML/NLP. Painting lover🎨.

ID: 1172234535116988416

linkhttps://yyy-apple.github.io/ calendar_today12-09-2019 19:44:34

23 Tweet

301 Takipçi

250 Takip Edilen

Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨New paper!🚨 Self-Rewarding LMs - LM itself provides its own rewards on own generations via LLM-as-a-Judge during Iterative DPO - Reward modeling ability improves during training rather than staying fixed ...opens the door to superhuman feedback? arxiv.org/abs/2401.10020 🧵(1/5)

🚨New paper!🚨
Self-Rewarding LMs
- LM itself provides its own rewards on own generations via LLM-as-a-Judge during Iterative DPO
- Reward modeling ability improves during training rather than staying fixed
...opens the door to superhuman feedback?
arxiv.org/abs/2401.10020
🧵(1/5)
Matthias Gallé (@mgalle) 's Twitter Profile Photo

In this work, led by Weizhe Yuan and coming to #ACL2024 we use this capacity to leverage *criteria* for specific writing tasks (🧑‍🎓,🧑‍💻). The feedback was almost always *valid* and *contextual* and most often *constructive* and *helpful* arxiv.org/abs/2403.01069

Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 New paper! 🚨 Following Length Constraints in Instructions - Shows SOTA LLMs can't follow length instructions - Introduces LIFT-DPO that fixes the problem - Helps solve length bias evaluation & training issues arxiv.org/abs/2406.17744 🧵(1/7)

🚨 New paper! 🚨
Following Length Constraints in Instructions
- Shows SOTA LLMs can't follow length instructions
- Introduces LIFT-DPO that fixes the problem
- Helps solve length bias evaluation & training issues
arxiv.org/abs/2406.17744
🧵(1/7)
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨New paper!🚨 Meta-Rewarding LMs - LM is actor, judge & meta-judge - Learns to reward actions better by judging its own judgments (assigning *meta-rewards*) - Improves acting & judging over time without human labels ... beats Self-Rewarding LMs arxiv.org/abs/2407.19594 🧵(1/6)

🚨New paper!🚨
Meta-Rewarding LMs
- LM is actor, judge & meta-judge
- Learns to reward actions better by judging its own judgments (assigning *meta-rewards*)
- Improves acting & judging over time without human labels
... beats Self-Rewarding LMs
arxiv.org/abs/2407.19594
🧵(1/6)
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨New paper!🚨 Self-Taught Evaluators - Llama 3-70B trained w/ synthetic data *only* - Iteratively finds better judgments in training - Best LLM-as-a-Judge model on RewardBench (88.3, 88.7 w/ maj vote) - Outperforms bigger models or human labels arxiv.org/abs/2408.02666 🧵(1/4)

🚨New paper!🚨
Self-Taught Evaluators
- Llama 3-70B trained w/ synthetic data *only*
- Iteratively finds better judgments in training
- Best LLM-as-a-Judge model on RewardBench (88.3, 88.7 w/ maj vote)
- Outperforms bigger models or human labels
arxiv.org/abs/2408.02666
🧵(1/4)
Pengfei Liu (@stefan_fee) 's Twitter Profile Photo

The first in-depth technical report on Replicating OpenAI's o1 !!! Uncover a Treasure Trove of Trial-and-Error Insights and Hard-Won Lessons. Some highlights: (1) We introduce a new training paradigm called ‘journey learning’ and propose the first model that successfully

The first in-depth technical report on Replicating OpenAI's o1 !!! Uncover a Treasure Trove of Trial-and-Error Insights and Hard-Won Lessons. Some highlights:

 (1) We introduce a new training paradigm called ‘journey learning’ and propose the first model that successfully
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨New work: Thinking LLMs!🚨 - Introduces Thought Preference Optimization (TPO) - Trains LLMs to think & respond for *all* instruction following tasks, not just math -Gives gains on AlpacaEval (beats GPT-4 & Llama3-70b) & ArenaHard with an 8B model arxiv.org/abs/2410.10630 🧵1/4

🚨New work: Thinking LLMs!🚨
- Introduces Thought Preference Optimization (TPO)
- Trains LLMs to think & respond for *all* instruction following tasks, not just math
-Gives gains on AlpacaEval (beats GPT-4 & Llama3-70b) & ArenaHard with an 8B model
arxiv.org/abs/2410.10630
🧵1/4
elvis (@omarsar0) 's Twitter Profile Photo

o1 Replication Journey These researchers report to be replicating the capabilities of OpenAI's o1 model. Apparently, their journey learning technique encourages learning not just shortcuts, but the complete exploration process, including trial and error, reflection, and

o1 Replication Journey

These researchers report to be replicating the capabilities of OpenAI's o1 model. 

Apparently, their journey learning technique encourages learning not just shortcuts, but the complete exploration process, including trial and error, reflection, and
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 Self-Consistency Preference Optimization (ScPO)🚨 - New self-training method without human labels - learn to make the model more consistent! - Works well for reasoning tasks where RMs fail to evaluate correctness. - Close to performance of supervised methods *without* labels,

🚨 Self-Consistency Preference Optimization (ScPO)🚨
- New self-training method without human labels - learn to make the model more consistent!
- Works well for reasoning tasks where RMs fail to evaluate correctness.
- Close to performance of supervised methods *without* labels,
elvis (@omarsar0) 's Twitter Profile Photo

o1 Replication Journey - Part 2 Shows that combining simple distillation from O1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks. "A base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains

o1 Replication Journey - Part 2

Shows that combining simple distillation from O1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks.

"A base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains
Jason Weston (@jaseweston) 's Twitter Profile Photo

💀 Introducing RIP: Rejecting Instruction Preferences💀 A method to *curate* high quality data, or *create* high quality synthetic data. Large performance gains across benchmarks (AlpacaEval2, Arena-Hard, WildBench). Paper 📄: arxiv.org/abs/2501.18578

💀 Introducing RIP: Rejecting Instruction Preferences💀

A method to *curate* high quality data, or *create* high quality synthetic data.

Large performance gains across benchmarks (AlpacaEval2, Arena-Hard, WildBench).

Paper 📄: arxiv.org/abs/2501.18578
Pengfei Liu (@stefan_fee) 's Twitter Profile Photo

#LIMR Less is More for RL Scaling! Less is More for RL Scaling! Less is More for RL Scaling! - What makes a good example for RL scaling? We demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. - How to make a

Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 New paper & dataset! 🚨 NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions - Synthesizes 2.8M challenging and diverse questions which require multi-step reasoning, along with reference answers - Shows steeper data scaling curve for knowledge distillation

🚨 New paper & dataset! 🚨
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
- Synthesizes 2.8M challenging and diverse questions which require multi-step reasoning, along with reference answers
- Shows steeper data scaling curve for knowledge distillation
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 New Paper 🚨 An Overview of Large Language Models for Statisticians 📝: arxiv.org/abs/2502.17814 - Dual perspectives on Statistics ➕ LLMs: Stat for LLM & LLM for Stat - Stat for LLM: How statistical methods can improve LLM uncertainty quantification, interpretability,

🚨 New Paper 🚨
An Overview of Large Language Models for Statisticians
📝: arxiv.org/abs/2502.17814

- Dual perspectives on Statistics ➕ LLMs: Stat for LLM & LLM for Stat
- Stat for LLM: How statistical methods can improve LLM uncertainty quantification, interpretability,
Jason Weston (@jaseweston) 's Twitter Profile Photo

Google friends & ex-colleagues -- Google scholar seems pretty broken😔. Our most cited paper from last year "Self-Rewarding LLMs" has disappeared! Scholar has clustered it with another paper (SPIN) and it isn't in the search results. This is bad for PhD student & first author

Google friends & ex-colleagues -- Google scholar seems pretty broken😔. Our most cited paper from last year "Self-Rewarding LLMs" has disappeared! Scholar has clustered it with another paper (SPIN) and it isn't in the search results. This is bad for PhD student & first author
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 - 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 
- 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the
Jason Weston (@jaseweston) 's Twitter Profile Photo

🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

🌉 Bridging Offline & Online RL for LLMs 🌉
📝: arxiv.org/abs/2506.21495
New paper shows on verifiable & non-verifiable tasks:
- Online DPO & GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO
Jason Weston (@jaseweston) 's Twitter Profile Photo

🌿Introducing NaturalThoughts 🌿 arxiv.org/abs/2507.01921 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in

🌿Introducing NaturalThoughts 🌿
arxiv.org/abs/2507.01921

🎯 Data curation for general reasoning capabilities is still relatively underexplored. 
- We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in
Jason Weston (@jaseweston) 's Twitter Profile Photo

🤖Introducing: CoT-Self-Instruct 🤖 📝: arxiv.org/abs/2507.23751 - Builds high-quality synthetic data via reasoning CoT + quality filtering - Gains on reasoning tasks: MATH500, AMC23, AIME24 & GPQA-💎 - Outperforms existing train data s1k & OpenMathReasoning - Gains on

🤖Introducing: CoT-Self-Instruct 🤖
📝: arxiv.org/abs/2507.23751
- Builds high-quality synthetic data via reasoning CoT + quality filtering
- Gains on reasoning tasks: MATH500, AMC23, AIME24 & GPQA-💎
- Outperforms existing train data s1k & OpenMathReasoning
- Gains on