Jack Zhang (@jcz42) 's Twitter Profile
Jack Zhang

@jcz42

ID: 1812253076025389056

calendar_today13-07-2024 22:29:50

7 Tweet

55 Followers

215 Following

Tri Dao (@tri_dao) 's Twitter Profile Photo

It's my favorite kind of work: linear algebra insight + fast kernels. When playing w Muon a while ago, we were thinking why not speed it up by operating on the small square matrix X X^T instead of the large rectangular matrix X. Jack, Noah, and Berlin spent many months

noahamsel (@noahamsel) 's Twitter Profile Photo

Announcing Gram Newton-Schulz, a new way to implement Muon that's 2x faster Trick 1: rejigger Newton-Schulz to replace rectangular matmuls with square symmetric ones Trick 2: collaborate with CUDA kings Jack Zhang and Berlin Chen from Tri Dao's lab Blog: dao-ailab.github.io/blog/2026/gram…

Jack Zhang (@jcz42) 's Twitter Profile Photo

Special thanks to noahamsel for being an amazing theory mentor and collaborator - for context, we had independently derived different fast algorithms for Newton-Schulz, but his version ended up being strictly faster than our original algorithm (and theoretically better

Jack Zhang (@jcz42) 's Twitter Profile Photo

You Jiacheng Theoretically (and on some test cases), fp16 is better since it’s more precise, and the tradeoff of a smaller range doesn’t affect newton schulz since the magnitudes shouldn’t get so high that they exit fp16’s smaller range. In practice, we haven’t seen a training run where fp16

Aaron Gokaslan (@skyli0n) 's Twitter Profile Photo

Amazing! Was trying to get the fast symmetric kernels working in FlashMUON, only to be disappointed they only supported the symmetric matrices, apparently converting them small square matrices is still faster! github.com/nil0x9/flash-m…

Tri Dao (@tri_dao) 's Twitter Profile Photo

Fast muon optimizer coming to consumer cards. All the code was written as matmul + epilogue so once the mainloop was implemented for Blackwell consumer cards, all the fancy symmetric matmul just works and get speed-of-light

david yan (@dzyan01) 's Twitter Profile Photo

Stereo depth is important in robotics, and relies heavily on synthetic data. But what actually makes for good synthetic data? In WMGStereo, we study dataset design and discover a powerful data recipe - just 500 samples of our data can match 40k Sceneflow samples! 🧵[1/7]

Wentao Guo (@wentaoguo7) 's Twitter Profile Photo

🚀SonicMoE🚀now runs at peak throughput on NVIDIA Blackwell GPUs 😃 54% & 35% higher fwd/bwd TFLOPS than the DeepGEMM baseline and 21% higher fwd TFLOPS than the triton official example. SonicMoE still maintains its minimum activation memory footprint: the same as a dense model

🚀SonicMoE🚀now runs at peak throughput on NVIDIA Blackwell GPUs 😃

54% & 35% higher fwd/bwd TFLOPS than the DeepGEMM baseline and 21% higher fwd TFLOPS than the triton official example. SonicMoE still maintains its minimum activation memory footprint: the same as a dense model