Manish Gupta (@bigmannim9) 's Twitter Profile
Manish Gupta

@bigmannim9

Software Engineer, Compiler Lover, Fortune Cookie Writer

ID: 149962387

linkhttps://www.linkedin.com/in/mguptaiitr/ calendar_today30-05-2010 17:51:10

506 Tweet

529 Followers

630 Following

Haicheng Wu (@asdf1234_0) 's Twitter Profile Photo

Big shout out to Manish Gupta from Google that contributed int8 x fp16 gemm on Ampere in newly released CUTLASS 3.3. CUTLASS 3.3 also allows non 128bit aligned gemm to use WGMMA on hopper. linkedin.com/posts/thakkarv…

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) github.com/karpathy/llm.c… On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention)
github.com/karpathy/llm.c…

On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,
Horace He (@chhillee) 's Twitter Profile Photo

For too long, users have lived under the software lottery tyranny of fused attention implementations. No longer. Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch. pytorch.org/blog/flexatten… 1/10

For too long, users have lived under the software lottery tyranny of fused attention implementations. 

No longer. 

Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch.
pytorch.org/blog/flexatten…
1/10
Haicheng Wu (@asdf1234_0) 's Twitter Profile Photo

CUTLASS 3.7 is tagged. The highlights are from our long term contributors and friends: Manish Gupta 's block scaling FP8 GEMM and Ali Hassani 's distributed GEMM. This is our last Hopper focused release. More exciting releases will come in 2025. github.com/NVIDIA/cutlass…

Vijay (@__tensorcore__) 's Twitter Profile Photo

CUDA 12.8 just dropped with Blackwell support. TensorCore 5th Generation Family Instructions: docs.nvidia.com/cuda/parallel-…

CUDA 12.8 just dropped with Blackwell support. 

TensorCore 5th Generation Family Instructions: docs.nvidia.com/cuda/parallel-…
Andrew Kerr (@arkerr) 's Twitter Profile Photo

CUTLASS 3.8 is out with full support of optimal Blackwell matrix computations and 5th generation Tensor Cores. Update your builds to use new numeric types, ample support for fused kernels, and CuTe enhancements for the Blackwell architecture. github.com/nvidia/cutlass

Haicheng Wu (@asdf1234_0) 's Twitter Profile Photo

CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver the next level performance. developer.nvidia.com/blog/cuda-tool…

Hieu Pham (@hyhieu226) 's Twitter Profile Photo

Among many great things about the Blackwell chip, having CUTLASS as its central programming model is my favorite. That framework is so good it makes programming low-level kernels enjoyable.

Manish Gupta (@bigmannim9) 's Twitter Profile Photo

Loved the talk, the pace, the poker example on explaining learning (training-time) vs. thinking (test-time). Great talk by Noam Brown youtu.be/MG9oqntiJKg?si…

Manish Gupta (@bigmannim9) 's Twitter Profile Photo

It is all unfolding now with the #Grok3 release. My prediction is that Elon Musk is going to make an offer again to buy OpenAI this week for $48.7b, 6 months later for $24.35b, and an year later for $12.17b. If they accept the offer he will fire the entire team.

Chief Nerd (@thechiefnerd) 's Twitter Profile Photo

CHAMATH: “The quality of Grok 3 relative to the amount of time that they've spent on this problem is to me, what's staggering”

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Zvi Mowshowitz My reaction is that there is an evaluation crisis. I don't really know what metrics to look at right now. MMLU was a good and useful for a few years but that's long over. SWE-Bench Verified (real, practical, verified problems) I really like and is great but itself too narrow.

Vijay (@__tensorcore__) 's Twitter Profile Photo

pip install nvidia-cutlass-dsl đź‘€ CUTLASS 4.0 is on the horizon, and the future is all pythonic! Come talk to us at GTC to learn more and attend our two talks

Mehdi Amini (@jokereph) 's Twitter Profile Photo

Almost 2 years at Nvidia, and the Tile IR project has been a very large part of my time here! So happy to see it finally coming to light. The CUDA GPU driver will now include a #MLIR-based JIT compiler! :) More MLIR-based announcement at GTC tomorrow in the Cutlass 4.0 Session!

Vijay (@__tensorcore__) 's Twitter Profile Photo

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

🚨🔥 CUTLASS 4.0 is released 🔥🚨

pip install nvidia-cutlass-dsl

4.0 marks a major shift for CUTLASS: towards native GPU programming in Python

slidehelloworld.png

docs.nvidia.com/cutlass/media/…