Yuchen Zhang (@yuchenzhan84564) Twitter Tweets • TwiCopy

Yuchen Zhang

@yuchenzhan84564

+ Follow

Ph.D. student at Peking University | THU-C3I | Shanghai AI Lab

ID: 1775818201554972672

linkhttp://yuczhang.com calendar_today04-04-2024 09:30:39

18 Tweet

40 Followers

17 Following

Ning Ding

@stingning

3 months ago

Interesting paper! I believe the reason why Figure 1 works essentially comes down to a situation we called lucky hit. That is, regardless of the ground truth, most sampled rollouts are wrong and thus receive zero reward — and these zero rewards are actually correct. For example,

thumb_up_off_alt34

chat_bubble_outline3

repeat9

shareShare

DailyPapers

@huggingpapers

3 months ago

New from PRIME-RL: The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models investigates and offers solutions for the collapse of policy entropy!

thumb_up_off_alt22

chat_bubble_outline2

repeat7

shareShare

AK

@_akhaliq

3 months ago

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

thumb_up_off_alt261

chat_bubble_outline1

repeat35

shareShare

Ning Ding

@stingning

3 months ago

Language models are trading Entropy for Rewards in reinforcement learning, meaning the uncertainty is transforming to certainty. The trading is even quantitively predictable: R = -a * exp(H) + b In our latest paper, we find that we should, and we can scientifically

thumb_up_off_alt494

chat_bubble_outline9

repeat71

shareShare

Andrew Zhao

@andrewz45732491

24 days ago

LLMs as internet/knowledge base, no need for external tools. Reminiscent of older work from AI2/UW, Rainer arxiv.org/pdf/2210.03078 and CRYSTAL arxiv.org/abs/2310.04921 arxiv.org/abs/2508.10874

thumb_up_off_alt318

chat_bubble_outline7

repeat54

shareShare

Lifan Yuan

@lifan__yuan

3 days ago

🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized. 🔗:husky-morocco-f72.notion.site/From-f-x-and-g…

thumb_up_off_alt391

chat_bubble_outline9

repeat80

shareShare

Yuxin Zuo

@zuo_yuxin

2 days ago

🧭 Thinking about proposing a new RL algorithm? We introduce UPGE for deep diving into post-training to give you a boost! 🤔 Many recent works mix RL with SFT, but with drastically different loss functions, why should they be used together? We introduce Unified Policy Gradient

thumb_up_off_alt13

chat_bubble_outline1

repeat9

shareShare