Jack Zhang (@jcz42) Twitter Tweets • TwiCopy

Tri Dao

24 days ago

It's my favorite kind of work: linear algebra insight + fast kernels. When playing w Muon a while ago, we were thinking why not speed it up by operating on the small square matrix X X^T instead of the large rectangular matrix X. Jack, Noah, and Berlin spent many months

thumb_up_off_alt1,1K

chat_bubble_outline2

repeat95

shareShare

noahamsel

@noahamsel

24 days ago

Announcing Gram Newton-Schulz, a new way to implement Muon that's 2x faster Trick 1: rejigger Newton-Schulz to replace rectangular matmuls with square symmetric ones Trick 2: collaborate with CUDA kings Jack Zhang and Berlin Chen from Tri Dao's lab Blog: dao-ailab.github.io/blog/2026/gram…

thumb_up_off_alt114

chat_bubble_outline1

repeat15

shareShare

You Jiacheng

@youjiacheng

24 days ago

oh, it has numerical issues, but they are fixed by restart x.com/jcz42/status/2…

thumb_up_off_alt20

chat_bubble_outline2

repeat1

shareShare

Charles 🎉 Frye

@charles_irl

24 days ago

now that's what i call a kernel trick

thumb_up_off_alt176

chat_bubble_outline4

repeat14

shareShare

Jack Zhang

@jcz42

24 days ago

Special thanks to noahamsel for being an amazing theory mentor and collaborator - for context, we had independently derived different fast algorithms for Newton-Schulz, but his version ended up being strictly faster than our original algorithm (and theoretically better

thumb_up_off_alt29

chat_bubble_outline1

repeat1

shareShare

You Jiacheng

@youjiacheng

24 days ago

Jack Zhang btw it seems that fp16 instead of bf16 is used, which one is better?

thumb_up_off_alt4

chat_bubble_outline1

repeat1

shareShare

Jack Zhang

@jcz42

24 days ago

You Jiacheng Theoretically (and on some test cases), fp16 is better since it’s more precise, and the tradeoff of a smaller range doesn’t affect newton schulz since the magnitudes shouldn’t get so high that they exit fp16’s smaller range. In practice, we haven’t seen a training run where fp16

thumb_up_off_alt15

chat_bubble_outline0

repeat1

shareShare

Jack Zhang

@jcz42

24 days ago

Fixing Berlin Chen tag, go follow him and checkout his other work on Mamba3 :)

thumb_up_off_alt23

chat_bubble_outline0

repeat1

shareShare

Aaron Gokaslan

@skyli0n

24 days ago

Amazing! Was trying to get the fast symmetric kernels working in FlashMUON, only to be disappointed they only supported the symmetric matrices, apparently converting them small square matrices is still faster! github.com/nil0x9/flash-m…

thumb_up_off_alt16

chat_bubble_outline0

repeat2

shareShare

Tri Dao

@tri_dao

17 days ago

Fast muon optimizer coming to consumer cards. All the code was written as matmul + epilogue so once the mainloop was implemented for Blackwell consumer cards, all the fancy symmetric matmul just works and get speed-of-light

thumb_up_off_alt309

chat_bubble_outline6

repeat30

shareShare

david yan

@dzyan01

11 days ago

Stereo depth is important in robotics, and relies heavily on synthetic data. But what actually makes for good synthetic data? In WMGStereo, we study dataset design and discover a powerful data recipe - just 500 samples of our data can match 40k Sceneflow samples! 🧵[1/7]

thumb_up_off_alt254

chat_bubble_outline4

repeat39

shareShare

Wentao Guo

@wentaoguo7

a day ago

🚀SonicMoE🚀now runs at peak throughput on NVIDIA Blackwell GPUs 😃 54% & 35% higher fwd/bwd TFLOPS than the DeepGEMM baseline and 21% higher fwd TFLOPS than the triton official example. SonicMoE still maintains its minimum activation memory footprint: the same as a dense model

thumb_up_off_alt293

chat_bubble_outline14

repeat56

shareShare

Jack Zhang

@jcz42

10 hours ago

Congrats noahamsel!

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare