Damek (@damekdavis) 's Twitter Profile
Damek

@damekdavis

Optimization and Machine Learning //
Assoc Prof @Wharton Statistics and Data Science //
Teaching "Optimization in PyTorch"
damekdavis.com/STAT-4830

ID: 1526271178310045698

linkhttp://damekdavis.com calendar_today16-05-2022 18:39:46

1,1K Tweet

3,3K Followers

834 Following

Damek (@damekdavis) 's Twitter Profile Photo

In this note w/ Ben Recht we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.

In this note w/ <a href="/beenwrekt/">Ben Recht</a> we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x:

max_θ 𝔼ₓ h(Prob(correct ∣ x; θ))

for certain functions h. Weirdly, h is arcsin(√t) in GRPO.