@rm_rafailov : We have a new preprint out - your language model is not a reward, it’s a Q function! 1. The likelihood of the preferred answer must go down - it’s a policy divergence 2. MCTS guided decoding on language is equivalent to likelihood search on DPO 3. DPO learns credit assignment • TwiCopy

Rafael Rafailov @ NeurIPS

@rm_rafailov

+ Follow

Ph.D. Student at @StanfordAILab. I work on Foundation Models and Decision Making. Previously @GoogleDeepMind @UCBerkeley

ID: 1660344669916786688

linkhttps://rmrafailov.github.io/ calendar_today21-05-2023 18:11:57

1,1K Tweet

6,6K Followers

776 Following

Rafael Rafailov @ NeurIPS

@rm_rafailov

a year ago

We have a new preprint out - your language model is not a reward, it’s a Q function! 1. The likelihood of the preferred answer must go down - it’s a policy divergence 2. MCTS guided decoding on language is equivalent to likelihood search on DPO 3. DPO learns credit assignment

thumb_up_off_alt945

chat_bubble_outline16

repeat156

shareShare