Weizhe Yuan (@weizhey) Twitter Tweets • TwiCopy

good girl

@goodgirlxsz

5 hours ago

🔥Telegram İfşa

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

🚨New paper!🚨 Self-Rewarding LMs - LM itself provides its own rewards on own generations via LLM-as-a-Judge during Iterative DPO - Reward modeling ability improves during training rather than staying fixed ...opens the door to superhuman feedback? arxiv.org/abs/2401.10020 🧵(1/5)

thumb_up_off_alt1,1K

chat_bubble_outline5

repeat222

shareShare

Matthias Gallé

@mgalle

a year ago

In this work, led by Weizhe Yuan and coming to #ACL2024 we use this capacity to leverage *criteria* for specific writing tasks (🧑‍🎓,🧑‍💻). The feedback was almost always *valid* and *contextual* and most often *constructive* and *helpful* arxiv.org/abs/2403.01069

thumb_up_off_alt3

chat_bubble_outline1

repeat3

shareShare

Jason Weston

@jaseweston

a year ago

🚨 New paper! 🚨 Following Length Constraints in Instructions - Shows SOTA LLMs can't follow length instructions - Introduces LIFT-DPO that fixes the problem - Helps solve length bias evaluation & training issues arxiv.org/abs/2406.17744 🧵(1/7)

thumb_up_off_alt193

chat_bubble_outline4

repeat34

shareShare

Jing Xu

@jingxu_ml

a year ago

Drop by Halle C #802 11:30-13:00 and check out our Self Reward work!

thumb_up_off_alt13

chat_bubble_outline0

repeat3

shareShare

Jason Weston

@jaseweston

a year ago

🚨New paper!🚨 Meta-Rewarding LMs - LM is actor, judge & meta-judge - Learns to reward actions better by judging its own judgments (assigning *meta-rewards*) - Improves acting & judging over time without human labels ... beats Self-Rewarding LMs arxiv.org/abs/2407.19594 🧵(1/6)

thumb_up_off_alt397

chat_bubble_outline2

repeat75

shareShare

Jason Weston

@jaseweston

a year ago

🚨New paper!🚨 Self-Taught Evaluators - Llama 3-70B trained w/ synthetic data *only* - Iteratively finds better judgments in training - Best LLM-as-a-Judge model on RewardBench (88.3, 88.7 w/ maj vote) - Outperforms bigger models or human labels arxiv.org/abs/2408.02666 🧵(1/4)

thumb_up_off_alt374

chat_bubble_outline2

repeat54

shareShare

Pengfei Liu

@stefan_fee

a year ago

The first in-depth technical report on Replicating OpenAI's o1 !!! Uncover a Treasure Trove of Trial-and-Error Insights and Hard-Won Lessons. Some highlights: (1) We introduce a new training paradigm called ‘journey learning’ and propose the first model that successfully

thumb_up_off_alt425

chat_bubble_outline5

repeat62

shareShare

Jason Weston

@jaseweston

a year ago

🚨New work: Thinking LLMs!🚨 - Introduces Thought Preference Optimization (TPO) - Trains LLMs to think & respond for *all* instruction following tasks, not just math -Gives gains on AlpacaEval (beats GPT-4 & Llama3-70b) & ArenaHard with an 8B model arxiv.org/abs/2410.10630 🧵1/4

thumb_up_off_alt697

chat_bubble_outline1

repeat146

shareShare

elvis

@omarsar0

a year ago

o1 Replication Journey These researchers report to be replicating the capabilities of OpenAI's o1 model. Apparently, their journey learning technique encourages learning not just shortcuts, but the complete exploration process, including trial and error, reflection, and

thumb_up_off_alt814

chat_bubble_outline14

repeat149

shareShare

Jason Weston

@jaseweston

a year ago

🚨 Self-Consistency Preference Optimization (ScPO)🚨 - New self-training method without human labels - learn to make the model more consistent! - Works well for reasoning tasks where RMs fail to evaluate correctness. - Close to performance of supervised methods *without* labels,

thumb_up_off_alt442

chat_bubble_outline1

repeat106

shareShare

elvis

@omarsar0

a year ago

o1 Replication Journey - Part 2 Shows that combining simple distillation from O1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks. "A base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains

thumb_up_off_alt298

chat_bubble_outline6

repeat51

shareShare

Jason Weston

@jaseweston

10 months ago

💀 Introducing RIP: Rejecting Instruction Preferences💀 A method to *curate* high quality data, or *create* high quality synthetic data. Large performance gains across benchmarks (AlpacaEval2, Arena-Hard, WildBench). Paper 📄: arxiv.org/abs/2501.18578

thumb_up_off_alt449

chat_bubble_outline1

repeat77

shareShare

Pengfei Liu

@stefan_fee

9 months ago

#LIMR Less is More for RL Scaling! Less is More for RL Scaling! Less is More for RL Scaling! - What makes a good example for RL scaling? We demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. - How to make a

thumb_up_off_alt169

chat_bubble_outline2

repeat33

shareShare

Jason Weston

@jaseweston

9 months ago

🚨 New paper & dataset! 🚨 NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions - Synthesizes 2.8M challenging and diverse questions which require multi-step reasoning, along with reference answers - Shows steeper data scaling curve for knowledge distillation

thumb_up_off_alt427

chat_bubble_outline2

repeat90

shareShare

Jason Weston

@jaseweston

9 months ago

🚨 New Paper 🚨 An Overview of Large Language Models for Statisticians 📝: arxiv.org/abs/2502.17814 - Dual perspectives on Statistics ➕ LLMs: Stat for LLM & LLM for Stat - Stat for LLM: How statistical methods can improve LLM uncertainty quantification, interpretability,

thumb_up_off_alt228

chat_bubble_outline0

repeat58

shareShare

Jason Weston

@jaseweston

7 months ago

Google friends & ex-colleagues -- Google scholar seems pretty broken😔. Our most cited paper from last year "Self-Rewarding LLMs" has disappeared! Scholar has clustered it with another paper (SPIN) and it isn't in the search results. This is bad for PhD student & first author

thumb_up_off_alt72

chat_bubble_outline5

repeat10

shareShare

Jason Weston

@jaseweston

6 months ago

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 - 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the

thumb_up_off_alt111

chat_bubble_outline2

repeat29

shareShare

Jason Weston

@jaseweston

5 months ago

🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

thumb_up_off_alt446

chat_bubble_outline1

repeat96

shareShare

Jason Weston

@jaseweston

5 months ago

🌿Introducing NaturalThoughts 🌿 arxiv.org/abs/2507.01921 🎯 Data curation for general reasoning capabilities is still relatively underexplored. - We systematically compare different metrics for selecting high-quality and diverse reasoning traces in terms of data efficiency in

thumb_up_off_alt405

chat_bubble_outline1

repeat74

shareShare

Jason Weston

@jaseweston

4 months ago

🤖Introducing: CoT-Self-Instruct 🤖 📝: arxiv.org/abs/2507.23751 - Builds high-quality synthetic data via reasoning CoT + quality filtering - Gains on reasoning tasks: MATH500, AMC23, AIME24 & GPQA-💎 - Outperforms existing train data s1k & OpenMathReasoning - Gains on

thumb_up_off_alt382

chat_bubble_outline1

repeat65

shareShare