Nice to see more off-policy RL works for LLM post-training! One thing I found interesting is that OAPL uses different β's for adv est. and reg. In our investigation (arxiv.org/pdf/2602.05933), OAPL (PMD-part) w/ same β leads to instability, suggesting this decoupling is crucial.