Anthonix (@zealandic1) 's Twitter Profile
Anthonix

@zealandic1

// Previously Neural Engine Architect @Apple // Reformed High Frequency Trader // CS PhD

ID: 1380287387742298113

calendar_today08-04-2021 22:32:19

174 Tweet

343 Followers

134 Following

Anthonix (@zealandic1) 's Twitter Profile Photo

๐Ÿ”ฅ PyTorch 2.4.0 on AMD is *huge* improvement! E.g., for same setup as plot below, perf is now at ~245,000 toks/sec! - compile() works; - flash attention works; - perf tuning works. It is now faster than the hipified llm.c! ๐ŸŽ‰ Thank you AMD, PyTorch & Soumith Chintala!

Anthonix (@zealandic1) 's Twitter Profile Photo

Wow, I just assumed that Zuck had created his own masterpiece outta chicken wire and bondo... turns out it is the work of a real artist

Anthonix (@zealandic1) 's Twitter Profile Photo

Finally got around to trying out llm.s on MI300x.. the code I had tuned on MI250x gets decent perf straight outta the gate. But wtf is going on with PyTorch perf on MI300x? Tried 2.4 & nightly, with rocm 6.1&6.2.. using full autotuning, flash attention etc.. all are so slow

Finally got around to trying out llm.s on MI300x.. the code I had tuned on MI250x gets decent perf straight outta the gate. 

But wtf is going on with PyTorch perf on MI300x? Tried 2.4 & nightly, with rocm 6.1&6.2.. using full autotuning, flash attention etc.. all are so slow
Anthonix (@zealandic1) 's Twitter Profile Photo

Awesome! Also for local training, this would enable training across a bunch of machines without expensive interconnect :)

Anthonix (@zealandic1) 's Twitter Profile Photo

Hitting ~3.2M toks/sec on MI300x for tiny llama3 training. Would love to see @pytorch training on MI300x get some massive improvements so I don't have to write my own kernels!