@damekdavis : In this note w/ @beenwrekt we look at RL problems with 50/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO. • TwiCopy

Damek

@damekdavis

+ Follow

Optimization and Machine Learning //
Assoc Prof @Wharton Statistics and Data Science //
Teaching "Optimization in PyTorch"
damekdavis.com/STAT-4830

ID: 1526271178310045698

linkhttp://damekdavis.com calendar_today16-05-2022 18:39:46

1,1K Tweet

3,3K Followers

834 Following

Damek

@damekdavis

a month ago

In this note w/ Ben Recht we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.

In this note w/ <a href="/beenwrekt/">Ben Recht</a> we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x:

max_θ 𝔼ₓ h(Prob(correct ∣ x; θ))

for certain functions h. Weirdly, h is arcsin(√t) in GRPO.

thumb_up_off_alt356

chat_bubble_outline9

repeat39

shareShare