@kempelab : PILAF (Policy-Interpolated Learning for Aligned Feedback): our response sampling scheme that provably aligns LLM preference learning w maximizing the underlying oracle reward! arxiv.org/abs/25502.04270 @feeelix_feng @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience • TwiCopy

@kempelab

+ Follow

Silver Professor at NYU Courant and CDS, Research Scientist at FAIR
Research in Machine Learning, past in Quantum Computing & Finance. Posts my own.

ID: 1782793458027036672

calendar_today23-04-2024 15:27:58

115 Tweet

1,1K Takipçi

124 Takip Edilen

7 months ago

PILAF (Policy-Interpolated Learning for Aligned Feedback): our response sampling scheme that provably aligns LLM preference learning w maximizing the underlying oracle reward! arxiv.org/abs/2502.04270 Yunzhen Feng Ariel Kwiatkowski Kunhao Zheng Yaqi Duan AI at Meta NYU Center for Data Science

thumb_up_off_alt78

chat_bubble_outline0

repeat10

shareShare