Wei Xiong (@weixiong_1) Twitter Tweets • TwiCopy

Wei Xiong

@weixiong_1

+ Follow

Statistical learning theory, Post-training of LLMs, RAFT, LMFlow, GSHF, and RLHFlow.

PhD Student @IllinoisCS, current @GoogleDeepMind, prev @MSFTResearch @USTC

ID: 1757977327282081792

linkhttps://weixiongust.github.io/WeiXiongUST/index.html calendar_today15-02-2024 03:57:17

165 Tweet

954 Followers

521 Following

Bowen Jin

@bowenjin13

9 months ago

🚀 Introducing 𝗦𝗲𝗮𝗿𝗰𝗵-𝗥𝟭 – the first 𝗿𝗲𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗼𝗳 𝗗𝗲𝗲𝗽𝘀𝗲𝗲𝗸-𝗥𝟭 (𝘇𝗲𝗿𝗼) for training reasoning and search-augmented LLM agents with reinforcement learning! This is a step towards training an 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗢𝗽𝗲𝗻𝗔𝗜 “𝗗𝗲𝗲𝗽

thumb_up_off_alt2,2K

chat_bubble_outline45

repeat326

shareShare

Qiusi Zhan

@zhanqiusi1

9 months ago

Our NAACL 2025 findings paper demonstrates that securing AI agents requires more than off-the-shelf defenses—adaptive attacks continuously evolve to bypass them. If you have any questions or want to discuss more, feel free to reach out!

thumb_up_off_alt7

chat_bubble_outline0

repeat4

shareShare

Hanze Dong @ ICLR 2025

@hendrydong

8 months ago

🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻github.com/RLHFlow/Minima… 📄arxiv.org/abs/2504.11343 👀RAFT was invited to ICLR25! Come & Chat☕️

thumb_up_off_alt456

chat_bubble_outline8

repeat110

shareShare

Wei Xiong

@weixiong_1

8 months ago

Surprised by the small performance gap between RAFT and Reinforce/GRPO. We may need more fine-grained negative signals to better guide learning.🧐

thumb_up_off_alt93

chat_bubble_outline2

repeat8

shareShare

Jiarui Yao

@explainmiracles

7 months ago

We introduce Gradient Variance Minimization (GVM)-RAFT, a principled dynamic sampling strategy that minimizes gradient variance to improve the efficiency of chain-of-thought (CoT) training in LLMs. – Achieves 2–4× faster convergence than RAFT – Improves accuracy on math

thumb_up_off_alt89

chat_bubble_outline0

repeat28

shareShare

Daniel Kang

@daniel_d_kang

5 months ago

Reinforcement learning enables LLMs to beat humans on programming/math competitions and has driven recent advances (OpenAI's o-series, Anthropic's Claude 4) Will RL enable broad generalization in the same way that pretraining does? Not with current techniques 🧵 1/7

thumb_up_off_alt26

chat_bubble_outline1

repeat8

shareShare

Kaiyu Yang

@kaiyuyang4

4 months ago

🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025! 📅 Dec 6 or 7 (TBD), 2025 🌴 San Diego, California

thumb_up_off_alt216

chat_bubble_outline7

repeat41

shareShare

Jason Weston

@jaseweston

3 months ago

🪜Introducing: StepWiser🦉 📝: arxiv.org/abs/2508.19229 - Reframes stepwise reward modeling as a reasoning task: outputs CoT + judgment. - Trained by RL using relative outcomes of rollouts. Results: (1) SOTA performance on ProcessBench! (2) Improves policy at train time. (3)

thumb_up_off_alt486

chat_bubble_outline11

repeat96

shareShare

Chenlu Ye

@ye_chenlu

3 months ago

PROF🌀Right answer, flawed reason?🤔🌀 📄arxiv.org/pdf/2509.03403 Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀 Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning

thumb_up_off_alt37

chat_bubble_outline2

repeat11

shareShare

Hanze Dong @ ICLR 2025

@hendrydong

2 months ago

💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. arxiv.org/html/2510.0499… github.com/RLHFlow/Reinfo…

thumb_up_off_alt182

chat_bubble_outline9

repeat24

shareShare

Fengzhuo Zhang

@fengzhuozhang

2 months ago

Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In

thumb_up_off_alt51

chat_bubble_outline1

repeat31

shareShare

Chenghao Yang

@chrome1996

2 months ago

Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial. Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and

thumb_up_off_alt93

chat_bubble_outline3

repeat19

shareShare

Jason Weston

@jaseweston

2 months ago

🌀Agent Learning via Early Experience🌀 📝: arxiv.org/abs/2510.08558 - SFT for agents is sparse; RL on long-horizons is hard We provide new mid-training signals that work: 1) Implicit next state world modeling task 2) Self-reflection on alternate states - Strong improvements over

thumb_up_off_alt188

chat_bubble_outline3

repeat41

shareShare

Wei Xiong

@weixiong_1

a month ago

😃 thanks, google!

thumb_up_off_alt100

chat_bubble_outline12

repeat6

shareShare