Vijay (@__tensorcore__) 's Twitter Profile
Vijay

@__tensorcore__

MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.

ID: 3280272739

linkhttps://thakkarv.dev calendar_today15-07-2015 03:34:51

1,1K Tweet

1,1K Takipçi

493 Takip Edilen

Daniel Galvez (@memorypaladin) 's Twitter Profile Photo

Most exciting addition in CUDA 12.9 for me is CUDA_LOG_FILE. You can finally get error strings to describe the error you received from a CUDA API call in more detail than a generic CUDA_ERROR_INVALID_VALUE! docs.nvidia.com/cuda/cuda-c-pr…

Vijay (@__tensorcore__) 's Twitter Profile Photo

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

🚨🔥 CUTLASS 4.0 is released 🔥🚨

pip install nvidia-cutlass-dsl

4.0 marks a major shift for CUTLASS: towards native GPU programming in Python

slidehelloworld.png

docs.nvidia.com/cutlass/media/…
Vijay (@__tensorcore__) 's Twitter Profile Photo

We believe low level access to hardware is extremely important. High level generators rob away the freedom of programmers to experiment with new ideas and kernel designs while C++ is too slow to compile, learn, and debug. CuTe DSL provides the best of both worlds ⚡

We believe low level access to hardware is extremely important. High level generators rob away the freedom of programmers to experiment with new ideas and kernel designs while C++ is too slow to compile, learn, and debug.

CuTe DSL provides the best of both worlds ⚡
Tri Dao (@tri_dao) 's Twitter Profile Photo

I love Cutlass, and this new Python DSL looks very well-designed. Will for sure accelerate kernel dev + exploring new ideas in ML + GPU. I'm already playing with it and having fun

Elliot Arledge (@elliotarledge) 's Twitter Profile Photo

timelapse #58 (14.5 hrs): - used cutlass python DSL to increase elementwise add/mul memory throughput (from pytorch 500GB/s to cutlass 850GB/s) - diving into cutlass 4.0 (minus tile abstractions) - cuda book design decisions with maharshi (महर्षि) - restructure of 5 chapters -

Jinay (@jinaycodes) 's Twitter Profile Photo

Introducing soarXiv ✈️, the most beautiful way to explore human knowledge Take any paper's URL and replace arxiv with soarxiv (show in video) to teleport to its place in the universe I've embedded all 2.8M papers up until April 2025 Try it at: soarxiv dot org

Tri Dao (@tri_dao) 's Twitter Profile Photo

We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GTA & GLA are steps in this direction: attention variants tailored for inference: high arithmetic intensity (make GPUs go brr even during decoding), easy to

Vijay (@__tensorcore__) 's Twitter Profile Photo

Another 🔥 blog about CUTLASS from Colfax International, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them. research.colfax-intl.com/cutlass-tutori…