Pradeep Ramani (@_prrama) 's Twitter Profile
Pradeep Ramani

@_prrama

15 Trillion Human Cells + 100 Trillion Bacterial cells + 1 consciousness.
Opinions are my own.

Sr. Architect @NVIDIA | CUTLASS | CUDA | GPGPU

ID: 470874233

linkhttps://www.linkedin.com/in/pradeep-ramani/ calendar_today22-01-2012 07:34:45

91 Tweet

209 Followers

153 Following

Soumith Chintala (@soumithchintala) 's Twitter Profile Photo

I have taken off today and yesterday at work, because I am not able to focus. I can't imagine having a paper deadline right now, and I can't imagine the personal stress my black friends are in. NeurIPS Conference consider extending the deadline, even if it's selectively done.

Pradeep Ramani (@_prrama) 's Twitter Profile Photo

Trying to book evacuation flights via Air India is probably the worst experience one can ever had dealing with any business ! If you are incapable of providing ANY level of service - don't do it ! Zero leadership, Zero Service, Zero transparency ! #AirIndiaSucks

Pradeep Ramani (@_prrama) 's Twitter Profile Photo

People are already so stressed out, stranded in the US with no Visa and No medical Insurance - and booking Evac flights via @airindiain is a nightmare !. No clarity, horrible customer service, dead website links and phone numbers !. FIX IT ! PMO India @airindiain #AllowPvt

Sundar Pichai (@sundarpichai) 's Twitter Profile Photo

Immigration has contributed immensely to America’s economic success, making it a global leader in tech, and also Google the company it is today. Disappointed by today’s proclamation - we’ll continue to stand with immigrants and work to expand opportunity for all.

Andrea Ventura (@aventura71) 's Twitter Profile Photo

A very sad day for US science and innovation. We will pay a hefty price for this demagogic insanity. 90% of my lab, myself included, is made of immigrants.

Andrew Ng (@andrewyng) 's Twitter Profile Photo

New U.S. Immigration and Customs Enforcement policy regarding F-1 visa international students is horrible & will hurt the US, students, and universities. Pushes universities to offer in-person classes even if unsafe or no pedagogical benefit, or students to leave US amidst pandemic and risk inability to return.

PyTorch (@pytorch) 's Twitter Profile Photo

v1.6: native mixed-precision support from NVIDIA (~2x perf improvement), distributed perf improvements, new profiling tool for memory consumption, Microsoft commits to developing and maintaining Windows PyTorch. Release Notes: github.com/pytorch/pytorc… Blog:pytorch.org/blog/pytorch-1…

Greg Siskind (@gsiskind) 's Twitter Profile Photo

I'm part of the pro bono litigation effort planning to quickly file a lawsuit challenging the onerous DOL wage rule impacting H-1Bs and PERMs. We're needing employers, employees and membership organizations to volunteer as plaintiffs. If interested, go to docs.google.com/forms/d/e/1FAI….

Jason Turner (@lefticus) 's Twitter Profile Photo

Find Carbon interesting? Want a modern approach to language design? WITH a compiler you can play with today? AND is prioritizing safety? AND has C++ interop? WHY haven't you looked at github.com/SerenityOS/jakt from @jntrnr and Andreas Kling ?

Dylan Patel ✈️ ICLR (@dylan522p) 's Twitter Profile Photo

If you work in AI this is the highest alpha channel out there What are you doing anon? Binge these videos now. youtube.com/@cudamode?si=M…

If you work in AI this is the highest alpha channel out there
What are you doing anon?
Binge these videos now.
youtube.com/@cudamode?si=M…
Tri Dao (@tri_dao) 's Twitter Profile Photo

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS!
1/
Haicheng Wu (@asdf1234_0) 's Twitter Profile Photo

CUTLASS reached 5K stars this summer with 3.5M downloads per month. Thank you for your support! github.com/NVIDIA/cutlass/

Vijay (@__tensorcore__) 's Twitter Profile Photo

🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚

🔥🚨 CUTLASS Blackwell is here 🚨🔥

3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀

Go check it out here: github.com/nvidia/cutlass

Can't wait to see what y'all end up cooking with this over the next few moths and years 💚
Haicheng Wu (@asdf1234_0) 's Twitter Profile Photo

CUTLASS is in the center of the CUDA Blackwell release blog. As always, we work hand in hand with CUDA team to deliver the next level performance. developer.nvidia.com/blog/cuda-tool…

Vijay (@__tensorcore__) 's Twitter Profile Photo

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

🚨🔥 CUTLASS 4.0 is released 🔥🚨

pip install nvidia-cutlass-dsl

4.0 marks a major shift for CUTLASS: towards native GPU programming in Python

slidehelloworld.png

docs.nvidia.com/cutlass/media/…
NVIDIA HPC Developer (@nvidiahpcdev) 's Twitter Profile Photo

🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and

🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and
Wentao Guo (@wentaoguo7) 's Twitter Profile Photo

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With Ted Zadouri and Tri Dao

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 

With <a href="/tedzadouri/">Ted Zadouri</a> and <a href="/tri_dao/">Tri Dao</a>