Mark Saroufim(@marksaroufim) 's Twitter Profileg
Mark Saroufim

@marksaroufim

@pytorch dev broadly interested in performance https://t.co/6KJ328JUwv

ID:35473191

linkhttp://marksaroufim.substack.com calendar_today26-04-2009 14:20:43

1,6K Tweets

8,9K Followers

656 Following

Follow People
Mark Saroufim(@marksaroufim) 's Twitter Profile Photo

llm.cpp was finally published today. It's very much CUDA C++ the good parts.
Code: github.com/gevtushenko/llโ€ฆ
Talk: youtube.com/watch?v=WiB_3Cโ€ฆ
Speakers: twitter.com/g_evtushenko and Jake Hemstad

account_circle
Kartikay Khandelwal(@kakemeister) 's Twitter Profile Photo

Really excited to officially release torchtune: a PyTorch-native library for easily fine-tuning LLMs!

Code: github.com/pytorch/torchtโ€ฆ
Blog: pytorch.org/blog/torchtuneโ€ฆ
Tutorials: pytorch.org/torchtune/stabโ€ฆ

[1/5]

account_circle
Mark Saroufim(@marksaroufim) 's Twitter Profile Photo

Got a sneak peek, best triton tutorial I've read so far. Grokked the differences between the triton & CUDA programming model. Gentler than official triton docs and goes into advanced topics like swizzling by the end

Tomorrow Saturday April 13 at noon PST discord.gg/cudamode

account_circle
Mark Saroufim(@marksaroufim) 's Twitter Profile Photo

Weird we haven't found better naming conventions for quantization algorithms like 'int4' is vague. That's the weight dtype but it's only applied to some layers or parts of it, accumulation always in fp32, gradient optimizer and activation all different too

account_circle
Mark Saroufim(@marksaroufim) 's Twitter Profile Photo

If you're looking to influence PyTorch's roadmap for lower precision dtypes, quantization and sparsity algorithms please leave some feedback on github.com/pytorch-labs/aโ€ฆ

This is from the team that brought you the sam-fast and gpt-fast quantization kernels

account_circle
William Falcon โšก๏ธ(@_willfalcon) 's Twitter Profile Photo

Highly recommend this video on writing optimized cuda kernels

by Mark Saroufim from the PyTorch team.

Perf checklist:
- coalesced global memory access
- maximize occupancy
- memory or compute bound
- minimize control divergence
... + 4 other items

youtube.com/watch?v=SGhfUhโ€ฆ

Highly recommend this video on writing optimized cuda kernels by @marksaroufim from the @PyTorch team. Perf checklist: - coalesced global memory access - maximize occupancy - memory or compute bound - minimize control divergence ... + 4 other items youtube.com/watch?v=SGhfUhโ€ฆ
account_circle
Mark Saroufim(@marksaroufim) 's Twitter Profile Photo

I've often heard 'I wish PyTorch had more dev internals documentation' when in reality the problem is we have too much. PyTorch is a deep project and it touches on pretty much all aspects of computer science so here are my favorite references

Intro
Christian S. Perone for an overview ofโ€ฆ

account_circle
Mark Saroufim(@marksaroufim) 's Twitter Profile Photo

On the subject of codegen I also wanna plug

from torch.utils.cpp_extension import load_inline

pass it a cuda kernel as a string and it'll generate the right build scripts for you

On the subject of codegen I also wanna plug from torch.utils.cpp_extension import load_inline pass it a cuda kernel as a string and it'll generate the right build scripts for you
account_circle
Andreas Kรถpf(@neurosp1ke) 's Twitter Profile Photo

โค๏ธโ€๐Ÿ”ฅCUDA MODE
Lecture 1: How to profile CUDA in PyTorch

Mark Saroufim lays the foundation: How to build & call a cuda kernel from torch, how to profile it.

Today, Jan 13
12:00 PM PST (Bay Area)
9:00 PM CET (Berlin)

Join us here: discord.gg/rTFYjfzp?eventโ€ฆ

account_circle
Ashvini Jindal(@akjindal53244) 's Twitter Profile Photo

๐ŸŒŸ First time at NeurIPS! ๐ŸŒŸ

๐Ÿš€ Excited to announce that our team ๐‘ผ๐’‘๐’‚๐’š๐’‚ (Ankur pawan rajpoot Ashvini Jindal) secured first rank ๐Ÿ† in NeurIPS ๐—Ÿ๐—Ÿ๐—  ๐—˜๐—ณ๐—ณ๐—ถ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐˜† ๐—–๐—ต๐—ฎ๐—น๐—น๐—ฒ๐—ป๐—ด๐—ฒ: ๐Ÿญ ๐—Ÿ๐—Ÿ๐—  + ๐Ÿญ๐—š๐—ฃ๐—จ + ๐Ÿญ๐——๐—ฎ๐˜†: llm-efficiency-challenge.github.io organized byโ€ฆ

๐ŸŒŸ First time at NeurIPS! ๐ŸŒŸ ๐Ÿš€ Excited to announce that our team ๐‘ผ๐’‘๐’‚๐’š๐’‚ (@ankurparikh85 @pawan_r24 @akjindal53244) secured first rank ๐Ÿ† in NeurIPS ๐—Ÿ๐—Ÿ๐—  ๐—˜๐—ณ๐—ณ๐—ถ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐˜† ๐—–๐—ต๐—ฎ๐—น๐—น๐—ฒ๐—ป๐—ด๐—ฒ: ๐Ÿญ ๐—Ÿ๐—Ÿ๐—  + ๐Ÿญ๐—š๐—ฃ๐—จ + ๐Ÿญ๐——๐—ฎ๐˜†: llm-efficiency-challenge.github.io organized byโ€ฆ
account_circle