Wei Xiong (@weixiong_1) 's Twitter Profile
Wei Xiong

@weixiong_1

Statistical learning theory, Post-training of LLMs, RAFT, LMFlow, GSHF, and RLHFlow.

PhD Student @IllinoisCS, current @GoogleDeepMind, prev @MSFTResearch @USTC

ID: 1757977327282081792

linkhttps://weixiongust.github.io/WeiXiongUST/index.html calendar_today15-02-2024 03:57:17

165 Tweet

954 Takipçi

521 Takip Edilen

Bowen Jin (@bowenjin13) 's Twitter Profile Photo

🚀 Introducing 𝗦𝗲𝗮𝗿𝗰𝗵-𝗥𝟭 – the first 𝗿𝗲𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗼𝗳 𝗗𝗲𝗲𝗽𝘀𝗲𝗲𝗸-𝗥𝟭 (𝘇𝗲𝗿𝗼) for training reasoning and search-augmented LLM agents with reinforcement learning! This is a step towards training an 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗢𝗽𝗲𝗻𝗔𝗜 “𝗗𝗲𝗲𝗽

Qiusi Zhan (@zhanqiusi1) 's Twitter Profile Photo

Our NAACL 2025 findings paper demonstrates that securing AI agents requires more than off-the-shelf defenses—adaptive attacks continuously evolve to bypass them. If you have any questions or want to discuss more, feel free to reach out!

Hanze Dong @ ICLR 2025 (@hendrydong) 's Twitter Profile Photo

🤖What makes GRPO work? Rejection Sampling→Reinforce→GRPO - RS is underrated - Key of GRPO: implicitly remove prompts without correct answer - Reinforce+Filtering > GRPO (better KL) 💻github.com/RLHFlow/Minima… 📄arxiv.org/abs/2504.11343 👀RAFT was invited to ICLR25! Come & Chat☕️

Wei Xiong (@weixiong_1) 's Twitter Profile Photo

Surprised by the small performance gap between RAFT and Reinforce/GRPO. We may need more fine-grained negative signals to better guide learning.🧐

Jiarui Yao (@explainmiracles) 's Twitter Profile Photo

We introduce Gradient Variance Minimization (GVM)-RAFT, a principled dynamic sampling strategy that minimizes gradient variance to improve the efficiency of chain-of-thought (CoT) training in LLMs. – Achieves 2–4× faster convergence than RAFT – Improves accuracy on math

We introduce Gradient Variance Minimization (GVM)-RAFT, a principled dynamic sampling strategy that minimizes gradient variance to improve the efficiency of chain-of-thought (CoT) training in LLMs.

– Achieves 2–4× faster convergence than RAFT
– Improves accuracy on math
Daniel Kang (@daniel_d_kang) 's Twitter Profile Photo

Reinforcement learning enables LLMs to beat humans on programming/math competitions and has driven recent advances (OpenAI's o-series, Anthropic's Claude 4) Will RL enable broad generalization in the same way that pretraining does? Not with current techniques 🧵 1/7

Kaiyu Yang (@kaiyuyang4) 's Twitter Profile Photo

🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025! 📅 Dec 6 or 7 (TBD), 2025 🌴 San Diego, California

🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025!
📅 Dec 6 or 7 (TBD), 2025
🌴 San Diego, California
Jason Weston (@jaseweston) 's Twitter Profile Photo

🪜Introducing: StepWiser🦉 📝: arxiv.org/abs/2508.19229 - Reframes stepwise reward modeling as a reasoning task: outputs CoT + judgment. - Trained by RL using relative outcomes of rollouts. Results: (1) SOTA performance on ProcessBench! (2) Improves policy at train time. (3)

🪜Introducing: StepWiser🦉
📝: arxiv.org/abs/2508.19229
- Reframes stepwise reward modeling as a reasoning task: outputs CoT + judgment.
- Trained by RL using relative outcomes of rollouts.
Results:
(1) SOTA performance on ProcessBench!
(2) Improves policy at train time.
(3)
Chenlu Ye (@ye_chenlu) 's Twitter Profile Photo

PROF🌀Right answer, flawed reason?🤔🌀 📄arxiv.org/pdf/2509.03403 Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀 Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning

PROF🌀Right answer, flawed reason?🤔🌀
📄arxiv.org/pdf/2509.03403
Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀
Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning
Hanze Dong @ ICLR 2025 (@hendrydong) 's Twitter Profile Photo

💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. arxiv.org/html/2510.0499… github.com/RLHFlow/Reinfo…

💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO

🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models.

⚙️ One-line drop-in. Real gains.
arxiv.org/html/2510.0499…

github.com/RLHFlow/Reinfo…
Fengzhuo Zhang (@fengzhuozhang) 's Twitter Profile Photo

Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In

Why does Muon outperform Adam—and how?

🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning

Three Key Findings:

> Associative memory parameters are the main beneficiaries of Muon, compared to Adam.

> Muon yields more isotropic weights than Adam.

> In
Chenghao Yang (@chrome1996) 's Twitter Profile Photo

Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial. Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and

Where is exploration most impactful in LLM reasoning? The initial tokens! They shape a sequence's entire semantic direction, making early exploration crucial.

Our new work, Exploratory Annealed Decoding (EAD), is built on this insight. By starting with high temperature and
Jason Weston (@jaseweston) 's Twitter Profile Photo

🌀Agent Learning via Early Experience🌀 📝: arxiv.org/abs/2510.08558 - SFT for agents is sparse; RL on long-horizons is hard We provide new mid-training signals that work: 1) Implicit next state world modeling task 2) Self-reflection on alternate states - Strong improvements over

🌀Agent Learning via Early Experience🌀
📝: arxiv.org/abs/2510.08558
- SFT for agents is sparse; RL on long-horizons is hard
We provide new mid-training signals that work:
1) Implicit next state world modeling task
2) Self-reflection on alternate states 
- Strong improvements over