Jiawei Wang
@jarvismsustc
Joint PhD Candidate @USTC and @MSFTResearch. Intern @MSFTResearch @deepseek_ai @BytedanceTalk
ID: 1460775848718610439
https://jarvisustc.github.io/ 17-11-2021 01:04:52
28 Tweet
42 Followers
201 Following
Daniel Han, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug
🚨 UPDATE to the "1 bit per episode" analysis (inspired by john schulman's post at Thinking Machines ): After discussion with mgostIH, I ned to points out the limit only applies to *scalar advantage*! REINFORCE with per-timestep advantages can learn O(T) bits when rewards are