Yuchen Zhang (@yuchenzhan84564) 's Twitter Profile
Yuchen Zhang

@yuchenzhan84564

Ph.D. student at Peking University | THU-C3I | Shanghai AI Lab

ID: 1775818201554972672

linkhttp://yuczhang.com calendar_today04-04-2024 09:30:39

18 Tweet

40 Followers

17 Following

Ning Ding (@stingning) 's Twitter Profile Photo

Interesting paper! I believe the reason why Figure 1 works essentially comes down to a situation we called lucky hit. That is, regardless of the ground truth, most sampled rollouts are wrong and thus receive zero reward — and these zero rewards are actually correct. For example,

DailyPapers (@huggingpapers) 's Twitter Profile Photo

New from PRIME-RL: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models investigates and offers solutions for the collapse of policy entropy!

New from PRIME-RL: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models investigates and offers solutions for the collapse of policy entropy!
Ning Ding (@stingning) 's Twitter Profile Photo

Language models are trading Entropy for Rewards in reinforcement learning, meaning the uncertainty is transforming to certainty. The trading is even quantitively predictable:   R = -a * exp(H) + b In our latest paper, we find that we should, and we can scientifically

Andrew Zhao (@andrewz45732491) 's Twitter Profile Photo

LLMs as internet/knowledge base, no need for external tools. Reminiscent of older work from AI2/UW, Rainer arxiv.org/pdf/2210.03078 and CRYSTAL arxiv.org/abs/2310.04921 arxiv.org/abs/2508.10874

LLMs as internet/knowledge base, no need for external tools. Reminiscent of older work from AI2/UW, Rainer arxiv.org/pdf/2210.03078 and CRYSTAL arxiv.org/abs/2310.04921

arxiv.org/abs/2508.10874
Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized. 🔗:husky-morocco-f72.notion.site/From-f-x-and-g…

🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized.

🔗:husky-morocco-f72.notion.site/From-f-x-and-g…
Yuxin Zuo (@zuo_yuxin) 's Twitter Profile Photo

🧭 Thinking about proposing a new RL algorithm? We introduce UPGE for deep diving into post-training to give you a boost! 🤔 Many recent works mix RL with SFT, but with drastically different loss functions, why should they be used together? We introduce Unified Policy Gradient

🧭 Thinking about proposing a new RL algorithm? We introduce UPGE for deep diving into post-training to give you a boost!

🤔 Many recent works mix RL with SFT, but with drastically different loss functions, why should they be used together?

We introduce Unified Policy Gradient