SzymonOzog (@szymonozog_) 's Twitter Profile
SzymonOzog

@szymonozog_

Maximizing throughput at @Aleph__Alpha

Educating people about GPUs at youtube.com/@szymonozog7862

ID: 1911051072468312064

calendar_today12-04-2025 13:37:51

1 Tweet

19 Takipçi

43 Takip Edilen

SzymonOzog (@szymonozog_) 's Twitter Profile Photo

Penny is now a working group! If you want to make the world a better place by creating a well documented, performant and minimalistic AllReduce example join the GPU MODE discord server!

Penny is now a working group! If you want to make the world a better place by creating a well documented, performant and minimalistic AllReduce example join the <a href="/GPU_MODE/">GPU MODE</a> discord server!
Aleksa Gordić (水平问题) (@gordic_aleksa) 's Twitter Profile Photo

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along.

(Remember matmul is the single most important operation that transformers execute
SzymonOzog (@szymonozog_) 's Twitter Profile Photo

Played around a bit with oneshot allreduce, already getting good results on small buffers(80% of NCCL) and it's just a lazy version, should be able to optimize this further

Played around a bit with oneshot allreduce, already getting good results on small buffers(80% of NCCL) and it's just a lazy version, should be able to optimize this further
SzymonOzog (@szymonozog_) 's Twitter Profile Photo

Interesting shift in GPU programming is the shift from parallel to parallel + async. Ampere was async loads Hopper was async loads + async wgmma ops Blackwell doesn't return values to registers but tensor memory When do we shrink the register file to save chip space?

Interesting shift in GPU programming is the shift from parallel to parallel + async. 

Ampere was async loads

Hopper was async loads + async wgmma ops

Blackwell doesn't return values to registers but tensor memory

When do we shrink the register file to save chip space?
SzymonOzog (@szymonozog_) 's Twitter Profile Photo

Did some work on speeding up oneshot reduction in Penny, huge gains on small buffers. Time to crack midsize buffers and update the worklog

Did some work on speeding up oneshot reduction in Penny, huge gains on small buffers. Time to crack midsize buffers and update the worklog
SzymonOzog (@szymonozog_) 's Twitter Profile Photo

Visualisation of achieved MFU for FP16 matmul across different shapes, you can clearly see the effect of Wave Quantization and Tile Quantization AKA pick your matrix shape wisely

Visualisation of achieved MFU for FP16 matmul across different shapes, you can clearly see the effect of Wave Quantization and Tile Quantization AKA pick your matrix shape wisely