Vijay (@__tensorcore__) Twitter Tweets • TwiCopy

Vijay

@__tensorcore__

+ Follow

MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.

ID: 3280272739

linkhttps://thakkarv.dev calendar_today15-07-2015 03:34:51

1,1K Tweet

1,1K Followers

493 Following

Daniel Galvez

@memorypaladin

7 months ago

Most exciting addition in CUDA 12.9 for me is CUDA_LOG_FILE. You can finally get error strings to describe the error you received from a CUDA API call in more detail than a generic CUDA_ERROR_INVALID_VALUE! docs.nvidia.com/cuda/cuda-c-pr…

thumb_up_off_alt15

chat_bubble_outline1

repeat1

shareShare

Vijay

@__tensorcore__

7 months ago

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

thumb_up_off_alt407

chat_bubble_outline15

repeat81

shareShare

Vijay

@__tensorcore__

7 months ago

We believe low level access to hardware is extremely important. High level generators rob away the freedom of programmers to experiment with new ideas and kernel designs while C++ is too slow to compile, learn, and debug. CuTe DSL provides the best of both worlds ⚡

thumb_up_off_alt27

chat_bubble_outline2

repeat2

shareShare

Tri Dao

@tri_dao

7 months ago

I love Cutlass, and this new Python DSL looks very well-designed. Will for sure accelerate kernel dev + exploring new ideas in ML + GPU. I'm already playing with it and having fun

thumb_up_off_alt225

chat_bubble_outline4

repeat25

shareShare

Elliot Arledge

@elliotarledge

7 months ago

timelapse #58 (14.5 hrs): - used cutlass python DSL to increase elementwise add/mul memory throughput (from pytorch 500GB/s to cutlass 850GB/s) - diving into cutlass 4.0 (minus tile abstractions) - cuda book design decisions with maharshi (महर्षि) - restructure of 5 chapters -

thumb_up_off_alt75

chat_bubble_outline3

repeat3

shareShare

Jinay

@jinaycodes

7 months ago

Introducing soarXiv ✈️, the most beautiful way to explore human knowledge Take any paper's URL and replace arxiv with soarxiv (show in video) to teleport to its place in the universe I've embedded all 2.8M papers up until April 2025 Try it at: soarxiv dot org

thumb_up_off_alt9,9K

chat_bubble_outline153

repeat1,1K

shareShare

Vijay

@__tensorcore__

7 months ago

thumb_up_off_alt45

chat_bubble_outline0

repeat4

shareShare

Vijay

@__tensorcore__

7 months ago

Every GPU kernel writer in shambles

thumb_up_off_alt117

chat_bubble_outline4

repeat8

shareShare

Moon

@moonl88537

6 months ago

did i mention that this is totally nuts?

thumb_up_off_alt6,6K

chat_bubble_outline189

repeat447

shareShare

Tri Dao

@tri_dao

6 months ago

We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GTA & GLA are steps in this direction: attention variants tailored for inference: high arithmetic intensity (make GPUs go brr even during decoding), easy to

thumb_up_off_alt447

chat_bubble_outline7

repeat50

shareShare

Vijay

@__tensorcore__

6 months ago

Another 🔥 blog about CUTLASS from Colfax International, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them. research.colfax-intl.com/cutlass-tutori…

thumb_up_off_alt156

chat_bubble_outline0

repeat34

shareShare