Manish Gupta (@bigmannim9) Twitter Tweets • TwiCopy

Haicheng Wu

2 years ago

Big shout out to Manish Gupta from Google that contributed int8 x fp16 gemm on Ampere in newly released CUTLASS 3.3. CUTLASS 3.3 also allows non 128bit aligned gemm to use WGMMA on hopper. linkedin.com/posts/thakkarv…

thumb_up_off_alt178

chat_bubble_outline0

repeat27

shareShare

Manish Gupta

@bigmannim9

2 years ago

Learn more about mixed-input matrix multiplication to speed-up large language models.

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Andrej Karpathy

@karpathy

2 years ago

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) github.com/karpathy/llm.c… On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,

thumb_up_off_alt5,5K

chat_bubble_outline154

repeat534

shareShare

Manish Gupta

@bigmannim9

a year ago

Sometimes you need to write a fast CUDA kernel, but most times you need many CUDA kernels written faster. #GPU #CUDA

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Horace He

@chhillee

a year ago

For too long, users have lived under the software lottery tyranny of fused attention implementations. No longer. Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch. pytorch.org/blog/flexatten… 1/10

thumb_up_off_alt1,1K

chat_bubble_outline24

repeat267

shareShare

Haicheng Wu

@asdf1234_0

10 months ago

CUTLASS 3.7 is tagged. The highlights are from our long term contributors and friends: Manish Gupta 's block scaling FP8 GEMM and Ali Hassani 's distributed GEMM. This is our last Hopper focused release. More exciting releases will come in 2025. github.com/NVIDIA/cutlass…

thumb_up_off_alt57

chat_bubble_outline2

repeat9

shareShare

Vijay

@__tensorcore__

10 months ago

CUDA 12.8 just dropped with Blackwell support. TensorCore 5th Generation Family Instructions: docs.nvidia.com/cuda/parallel-…

thumb_up_off_alt320

chat_bubble_outline12

repeat57

shareShare

Andrew Kerr

@arkerr

10 months ago

CUTLASS 3.8 is out with full support of optimal Blackwell matrix computations and 5th generation Tensor Cores. Update your builds to use new numeric types, ample support for fused kernels, and CuTe enhancements for the Blackwell architecture. github.com/nvidia/cutlass

thumb_up_off_alt97

chat_bubble_outline1

repeat14

shareShare

Haicheng Wu

@asdf1234_0

10 months ago

CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver the next level performance. developer.nvidia.com/blog/cuda-tool…

thumb_up_off_alt125

chat_bubble_outline1

repeat25

shareShare

Hieu Pham

@hyhieu226

10 months ago

Among many great things about the Blackwell chip, having CUTLASS as its central programming model is my favorite. That framework is so good it makes programming low-level kernels enjoyable.

thumb_up_off_alt154

chat_bubble_outline2

repeat7

shareShare

Manish Gupta

@bigmannim9

9 months ago

Loved the talk, the pace, the poker example on explaining learning (training-time) vs. thinking (test-time). Great talk by Noam Brown youtu.be/MG9oqntiJKg?si…

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Manish Gupta

@bigmannim9

9 months ago

It is all unfolding now with the #Grok3 release. My prediction is that Elon Musk is going to make an offer again to buy OpenAI this week for $48.7b, 6 months later for $24.35b, and an year later for $12.17b. If they accept the offer he will fire the entire team.

thumb_up_off_alt10

chat_bubble_outline1

repeat1

shareShare

Chief Nerd

@thechiefnerd

9 months ago

CHAMATH: “The quality of Grok 3 relative to the amount of time that they've spent on this problem is to me, what's staggering”

thumb_up_off_alt1,1K

chat_bubble_outline68

repeat130

shareShare

Soumith Chintala

@soumithchintala

9 months ago

we've been working on democratizing fast kernel writing on the PyTorch team. try the challenge, either you or your AI!

thumb_up_off_alt353

chat_bubble_outline14

repeat27

shareShare

Andrej Karpathy

@karpathy

9 months ago

Zvi Mowshowitz My reaction is that there is an evaluation crisis. I don't really know what metrics to look at right now. MMLU was a good and useful for a few years but that's long over. SWE-Bench Verified (real, practical, verified problems) I really like and is great but itself too narrow.

thumb_up_off_alt2,2K

chat_bubble_outline139

repeat177

shareShare

Vijay

@__tensorcore__

8 months ago

pip install nvidia-cutlass-dsl 👀 CUTLASS 4.0 is on the horizon, and the future is all pythonic! Come talk to us at GTC to learn more and attend our two talks

thumb_up_off_alt152

chat_bubble_outline3

repeat18

shareShare

Mehdi Amini

@jokereph

8 months ago

Almost 2 years at Nvidia, and the Tile IR project has been a very large part of my time here! So happy to see it finally coming to light. The CUDA GPU driver will now include a #MLIR-based JIT compiler! :) More MLIR-based announcement at GTC tomorrow in the Cutlass 4.0 Session!

thumb_up_off_alt395

chat_bubble_outline12

repeat38

shareShare

Andrew Kerr

@arkerr

8 months ago

Blackwell Tensor Core 2CTA GEMM example in one slide courtesy of CuTE. github.com/NVIDIA/cutlass…

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare

Vijay

@__tensorcore__

6 months ago

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

thumb_up_off_alt407

chat_bubble_outline15

repeat81

shareShare

Vijay

@__tensorcore__

6 months ago

thumb_up_off_alt45

chat_bubble_outline0

repeat4

shareShare