SzymonOzog (@szymonozog_) Twitter Tweets • TwiCopy

Penny is now a working group! If you want to make the world a better place by creating a well documented, performant and minimalistic AllReduce example join the GPU MODE discord server!

thumb_up_off_alt60

chat_bubble_outline2

repeat9

shareShare

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute

thumb_up_off_alt2,2K

chat_bubble_outline47

repeat390

shareShare

SzymonOzog

@szymonozog_

2 months ago

Played around a bit with oneshot allreduce, already getting good results on small buffers(80% of NCCL) and it's just a lazy version, should be able to optimize this further

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

SzymonOzog

@szymonozog_

2 months ago

Interesting shift in GPU programming is the shift from parallel to parallel + async. Ampere was async loads Hopper was async loads + async wgmma ops Blackwell doesn't return values to registers but tensor memory When do we shrink the register file to save chip space?

thumb_up_off_alt250

chat_bubble_outline9

repeat14

shareShare

SzymonOzog

@szymonozog_

2 months ago

Did some work on speeding up oneshot reduction in Penny, huge gains on small buffers. Time to crack midsize buffers and update the worklog

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

SzymonOzog

@szymonozog_

2 months ago

Visualisation of achieved MFU for FP16 matmul across different shapes, you can clearly see the effect of Wave Quantization and Tile Quantization AKA pick your matrix shape wisely

thumb_up_off_alt200

chat_bubble_outline2

repeat15

shareShare

SzymonOzog

@szymonozog_

2 months ago

Fact: 90% of researcheres kill their run before the model is about to recover

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

SzymonOzog

SzymonOzog

SzymonOzog

SzymonOzog

SzymonOzog

SzymonOzog

SzymonOzog

Aleksa Gordić (水平问题)

SzymonOzog

SzymonOzog

SzymonOzog

SzymonOzog

SzymonOzog