WHY do we need the causal mask in large-scale language modeling?
In our latest blog post, we challenge common misconceptions, historically motivate alternatives and state the main challenge in moving beyond the causal mask.
pdoom.org/causal_mask.ht…
A very simple observation explains a large chunk of the research progress of the last decade and permits extrapolating what ideas will truly matter in the future:
Neural networks do not generalize out of distribution.
pdoom.org/thesis.html
PPO is not an on-policy algorithm and the strictly on-policy version of PPO (using a single optimizer step per PPO update) reduces to the vanilla policy gradient with baseline.
pdoom.org/ppo_off_policy…