Colfax International (@colfaxintl) Twitter Tweets • TwiCopy

PyTorch

a year ago

Introducing FlashAttention-3 🚀 Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to Colfax International, AI at Meta, NVIDIA AI and Together AI for the collaboration here 🙌 Read more in our blog: hubs.la/Q02Gf4hf0

Introducing FlashAttention-3 🚀 Fast and Accurate Attention with Asynchrony and Low-precision.

Thank you to <a href="/colfaxintl/">Colfax International</a>, <a href="/AIatMeta/">AI at Meta</a>, <a href="/NVIDIAAI/">NVIDIA AI</a> and <a href="/togethercompute/">Together AI</a> for the collaboration here 🙌

Read more in our blog: hubs.la/Q02Gf4hf0

thumb_up_off_alt456

chat_bubble_outline5

repeat94

shareShare

Vijay

@__tensorcore__

a year ago

FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with Tri Dao, Colfax International, Meta, and my Pradeep Ramani. We can get up to 760 TFLOP/s on head dim 128 forward pass!

thumb_up_off_alt288

chat_bubble_outline5

repeat52

shareShare

Tri Dao

@tri_dao

a year ago

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

thumb_up_off_alt2,2K

chat_bubble_outline29

repeat336

shareShare

Tri Dao

@tri_dao

a year ago

This project is a collab with Jay Shah & Ganesh Bikshandi (Colfax International), Ying Zhang (@meta), @DROP_ALL_TABLES and Pradeep Ramani (NVIDIA). Huge thanks to the CUTLASS team, cuDNN team, and Together AI, and Princeton PLI for their support! 9/9

thumb_up_off_alt67

chat_bubble_outline1

repeat3

shareShare

Together AI

@togethercompute

a year ago

We are thrilled to release FlashAttention-3 in partnership with Meta , NVIDIA, Princeton University, and Colfax International. The improvements from FlashAttention-3 will result in: • More efficient GPU Utilization - up to 75% from 35%. • Better performance with lower precision while

thumb_up_off_alt252

chat_bubble_outline3

repeat30

shareShare

Colfax International

@colfaxintl

a year ago

𝗙𝗿𝗲𝗲 𝗡𝗩𝗜𝗗𝗜𝗔® 𝗛𝟭𝟬𝟬 𝗡𝗩𝗟 𝗧𝗲𝘀𝘁 𝗗𝗿𝗶𝘃𝗲 Supercharge #LLM Inference #Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs. 👉𝘓𝘦𝘢𝘳𝘯 𝘮𝘰𝘳𝘦 experience.colfax-intl.com/project/nvidia… #AI

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Hieu Pham

@hyhieu226

a year ago

📚🧑‍🎓New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) research.colfax-intl.com/cutlass-tutori… If you have run PyTorch, Jax, or FlashAttention-3 on an H100 GPU, you have used WGMMA. Arguably the most important primitive in the H100's Hopper architecture, WGMMA is the

thumb_up_off_alt218

chat_bubble_outline6

repeat50

shareShare

Colfax International

@colfaxintl

a year ago

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗖𝗼𝗹𝗳𝗮𝘅 𝗔𝗰𝗰𝗲𝘀𝘀 𝗛𝘂𝗯: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more — all before they are shipped to you The service is free for all Colfax customers. experience.colfax-intl.com/access-hub/

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Hieu Pham

@hyhieu226

a year ago

Cosmin Negruseri Tri Dao I am still learning it. Just like nobody knows all C++, nobody knows CUDA. For resources, I highly recommend the series of tutorials using CUTLASS and CUTE by Colfax International. I am obviously not biased 🤣

thumb_up_off_alt34

chat_bubble_outline4

repeat2

shareShare

Colfax International

@colfaxintl

a year ago

We have a few tutorials posted and few of them lined up. More here research.colfax-intl.com

thumb_up_off_alt5

chat_bubble_outline0

repeat2

shareShare

Hieu Pham

@hyhieu226

a year ago

Checkout our newest CUDA tutorial. research.colfax-intl.com/cutlass-tutori… The topic is software pipelining: overlap mem copying with compute to hide latency. We present the concept in the context of GEMM (matmul), but it's applicable everywhere. e.g., Flash Attention 2 and 3. Conceptually

thumb_up_off_alt752

chat_bubble_outline18

repeat111

shareShare

Hieu Pham

@hyhieu226

10 months ago

research.colfax-intl.com/epilogue_visit… A chess game typically has 3 phases: opening, middle game, and endgame. A GEMM (matmul) kernel typically has: prologue, main loop, and epilogue. Just like chess grandmasters must master endgames, GEMM programmers must master epilogue techniques to write

thumb_up_off_alt191

chat_bubble_outline1

repeat24

shareShare

Colfax International

@colfaxintl

9 months ago

In this GPU MODE lecture, Jay Shah, Research Scientist at Colfax International, presents his joint work on FlashAttention-3 and how to implement the main compute loop in the algorithm using CUTLASS. research.colfax-intl.com/gpu-mode-cutla…

thumb_up_off_alt18

chat_bubble_outline0

repeat3

shareShare

Colfax International

@colfaxintl

9 months ago

In this blog post, Jay Shah, Research Scientist at Colfax International, collaborated with Character.AI to explain two techniques (INT8 Quantization and Query Head Packing for MQA/GQA) that are important for using FlashAttention-3 for inference research.character.ai/optimizing-ai-…

thumb_up_off_alt14

chat_bubble_outline0

repeat3

shareShare

Colfax International

@colfaxintl

8 months ago

𝗖𝗨𝗧𝗟𝗔𝗦𝗦 𝗧𝘂𝘁𝗼𝗿𝗶𝗮𝗹: 𝗣𝗲𝗿𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗮𝗻𝗱 𝗦𝘁𝗿𝗲𝗮𝗺-𝗞 Final part of our three part series on writing optimized GEMM kernels for NVIDIA GPUs using CUTLASS library abstractions. research.colfax-intl.com/cutlass-tutori…

thumb_up_off_alt21

chat_bubble_outline0

repeat1

shareShare

Colfax International

@colfaxintl

7 months ago

𝗖𝗼𝗹𝗳𝗮𝘅 𝗻𝗼𝘄 𝗼𝗳𝗳𝗲𝗿𝘀 𝗡𝗩𝗜𝗗𝗜𝗔 𝗕𝗹𝗮𝗰𝗸𝘄𝗲𝗹𝗹-𝗯𝗮𝘀𝗲𝗱 𝘀𝗲𝗿𝘃𝗲𝗿𝘀 👉 8U/10U servers • NVIDIA HGX™ B200 8-GPU baseboard • 2x AMD EPYC™ 9004/9005 OR 2x 4th/5th Gen Intel® Xeon® Scalable OR 2x Intel® Xeon® 6900 series Learn more colfax-intl.com/ServerList.asp…

thumb_up_off_alt4

chat_bubble_outline0

repeat2

shareShare

Colfax International

@colfaxintl

7 months ago

The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier

thumb_up_off_alt15

chat_bubble_outline0

repeat3

shareShare

Vijay

@__tensorcore__

4 months ago

xjdr sure! colfax just dropped a banger today: research.colfax-intl.com/cutlass-tutori… We have a series of really nice incremental tutorials: github.com/NVIDIA/cutlass… The two GTC talks as I mentioned are the richest source out there on Blackwell MMAs. TK blog is nice too.

thumb_up_off_alt35

chat_bubble_outline1

repeat5

shareShare

Hieu Pham

@hyhieu226

4 months ago

research.colfax-intl.com/cutlass-tutori… Their content always comes out in great quantity and quality ❤️

thumb_up_off_alt143

chat_bubble_outline0

repeat19

shareShare