Jack Zhang
@jcz42
ID: 1812253076025389056
13-07-2024 22:29:50
7 Tweet
55 Followers
215 Following
Announcing Gram Newton-Schulz, a new way to implement Muon that's 2x faster Trick 1: rejigger Newton-Schulz to replace rectangular matmuls with square symmetric ones Trick 2: collaborate with CUDA kings Jack Zhang and Berlin Chen from Tri Dao's lab Blog: dao-ailab.github.io/blog/2026/gram…
Jack Zhang btw it seems that fp16 instead of bf16 is used, which one is better?
You Jiacheng Theoretically (and on some test cases), fp16 is better since it’s more precise, and the tradeoff of a smaller range doesn’t affect newton schulz since the magnitudes shouldn’t get so high that they exit fp16’s smaller range. In practice, we haven’t seen a training run where fp16
Fixing Berlin Chen tag, go follow him and checkout his other work on Mamba3 :)