Zhenghao Xu (@zhenghaoxu0) 's Twitter Profile
Zhenghao Xu

@zhenghaoxu0

ID: 1567748825758023680

calendar_today08-09-2022 05:37:13

1 Tweet

2 Followers

89 Following

Zhenghao Xu (@zhenghaoxu0) 's Twitter Profile Photo

Nice to see more off-policy RL works for LLM post-training! One thing I found interesting is that OAPL uses different β's for adv est. and reg. In our investigation (arxiv.org/pdf/2602.05933), OAPL (PMD-part) w/ same β leads to instability, suggesting this decoupling is crucial.