Zhenghao Xu (@zhenghaoxu0) Twitter Tweets • TwiCopy

Zhenghao Xu

@zhenghaoxu0

+ Follow

ID: 1567748825758023680

calendar_today08-09-2022 05:37:13

1 Tweet

2 Followers

89 Following

Zhenghao Xu

@zhenghaoxu0

2 months ago

Nice to see more off-policy RL works for LLM post-training! One thing I found interesting is that OAPL uses different β's for adv est. and reg. In our investigation (arxiv.org/pdf/2602.05933), OAPL (PMD-part) w/ same β leads to instability, suggesting this decoupling is crucial.

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare