Colfax International (@colfaxintl) 's Twitter Profile
Colfax International

@colfaxintl

HPC & AI Solutions (colfax-intl.com) | Research (research.colfax-intl.com) | Colfax Experience Center (experience.colfax-intl.com)

ID: 21418950

linkhttps://colfax-intl.com calendar_today20-02-2009 18:17:08

650 Tweet

964 Takipçi

8 Takip Edilen

PyTorch (@pytorch) 's Twitter Profile Photo

Introducing FlashAttention-3 🚀 Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to Colfax International, AI at Meta, NVIDIA AI and Together AI for the collaboration here 🙌 Read more in our blog: hubs.la/Q02Gf4hf0

Introducing FlashAttention-3 🚀 Fast and Accurate Attention with Asynchrony and Low-precision.

Thank you to <a href="/colfaxintl/">Colfax International</a>, <a href="/AIatMeta/">AI at Meta</a>, <a href="/NVIDIAAI/">NVIDIA AI</a> and <a href="/togethercompute/">Together AI</a> for the collaboration here 🙌

Read more in our blog: hubs.la/Q02Gf4hf0
Vijay (@__tensorcore__) 's Twitter Profile Photo

FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with Tri Dao, Colfax International, Meta, and my Pradeep Ramani. We can get up to 760 TFLOP/s on head dim 128 forward pass!

FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with <a href="/tri_dao/">Tri Dao</a>, <a href="/colfaxintl/">Colfax International</a>, <a href="/Meta/">Meta</a>, and my <a href="/_prrama/">Pradeep Ramani</a>. We can get up to 760 TFLOP/s on head dim 128 forward pass!
Tri Dao (@tri_dao) 's Twitter Profile Photo

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS!
1/
Tri Dao (@tri_dao) 's Twitter Profile Photo

This project is a collab with Jay Shah & Ganesh Bikshandi (Colfax International), Ying Zhang (@meta), @DROP_ALL_TABLES and Pradeep Ramani (NVIDIA). Huge thanks to the CUTLASS team, cuDNN team, and Together AI, and Princeton PLI for their support! 9/9

Together AI (@togethercompute) 's Twitter Profile Photo

We are thrilled to release FlashAttention-3 in partnership with Meta , NVIDIA, Princeton University, and Colfax International. The improvements from FlashAttention-3 will result in: • More efficient GPU Utilization - up to 75% from 35%. • Better performance with lower precision while

Colfax International (@colfaxintl) 's Twitter Profile Photo

𝗙𝗿𝗲𝗲 𝗡𝗩𝗜𝗗𝗜𝗔® 𝗛𝟭𝟬𝟬 𝗡𝗩𝗟 𝗧𝗲𝘀𝘁 𝗗𝗿𝗶𝘃𝗲 Supercharge #LLM Inference #Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs. 👉𝘓𝘦𝘢𝘳𝘯 𝘮𝘰𝘳𝘦 experience.colfax-intl.com/project/nvidia… #AI

𝗙𝗿𝗲𝗲 𝗡𝗩𝗜𝗗𝗜𝗔® 𝗛𝟭𝟬𝟬 𝗡𝗩𝗟 𝗧𝗲𝘀𝘁 𝗗𝗿𝗶𝘃𝗲
Supercharge #LLM Inference

#Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs.

👉𝘓𝘦𝘢𝘳𝘯 𝘮𝘰𝘳𝘦 experience.colfax-intl.com/project/nvidia…

#AI
Hieu Pham (@hyhieu226) 's Twitter Profile Photo

📚🧑‍🎓New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) research.colfax-intl.com/cutlass-tutori… If you have run PyTorch, Jax, or FlashAttention-3 on an H100 GPU, you have used WGMMA. Arguably the most important primitive in the H100's Hopper architecture, WGMMA is the

📚🧑‍🎓New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) research.colfax-intl.com/cutlass-tutori…

If you have run PyTorch, Jax, or FlashAttention-3 on an H100 GPU, you have used WGMMA.

Arguably the most important primitive in the H100's Hopper architecture, WGMMA is the
Colfax International (@colfaxintl) 's Twitter Profile Photo

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗖𝗼𝗹𝗳𝗮𝘅 𝗔𝗰𝗰𝗲𝘀𝘀 𝗛𝘂𝗯: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more — all before they are shipped to you The service is free for all Colfax customers. experience.colfax-intl.com/access-hub/

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗖𝗼𝗹𝗳𝗮𝘅 𝗔𝗰𝗰𝗲𝘀𝘀 𝗛𝘂𝗯: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more — all before they are shipped to you

The service is free for all Colfax customers.

experience.colfax-intl.com/access-hub/
Hieu Pham (@hyhieu226) 's Twitter Profile Photo

Cosmin Negruseri Tri Dao I am still learning it. Just like nobody knows all C++, nobody knows CUDA. For resources, I highly recommend the series of tutorials using CUTLASS and CUTE by Colfax International. I am obviously not biased 🤣

Hieu Pham (@hyhieu226) 's Twitter Profile Photo

Checkout our newest CUDA tutorial. research.colfax-intl.com/cutlass-tutori… The topic is software pipelining: overlap mem copying with compute to hide latency. We present the concept in the context of GEMM (matmul), but it's applicable everywhere. e.g., Flash Attention 2 and 3. Conceptually

Hieu Pham (@hyhieu226) 's Twitter Profile Photo

research.colfax-intl.com/epilogue_visit… A chess game typically has 3 phases: opening, middle game, and endgame. A GEMM (matmul) kernel typically has: prologue, main loop, and epilogue. Just like chess grandmasters must master endgames, GEMM programmers must master epilogue techniques to write

Colfax International (@colfaxintl) 's Twitter Profile Photo

In this GPU MODE lecture, Jay Shah, Research Scientist at Colfax International, presents his joint work on FlashAttention-3 and how to implement the main compute loop in the algorithm using CUTLASS. research.colfax-intl.com/gpu-mode-cutla…

Colfax International (@colfaxintl) 's Twitter Profile Photo

In this blog post, Jay Shah, Research Scientist at Colfax International, collaborated with Character.AI to explain two techniques (INT8 Quantization and Query Head Packing for MQA/GQA) that are important for using FlashAttention-3 for inference research.character.ai/optimizing-ai-…

Colfax International (@colfaxintl) 's Twitter Profile Photo

𝗖𝗨𝗧𝗟𝗔𝗦𝗦 𝗧𝘂𝘁𝗼𝗿𝗶𝗮𝗹: 𝗣𝗲𝗿𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗮𝗻𝗱 𝗦𝘁𝗿𝗲𝗮𝗺-𝗞 Final part of our three part series on writing optimized GEMM kernels for NVIDIA GPUs using CUTLASS library abstractions. research.colfax-intl.com/cutlass-tutori…

Colfax International (@colfaxintl) 's Twitter Profile Photo

𝗖𝗼𝗹𝗳𝗮𝘅 𝗻𝗼𝘄 𝗼𝗳𝗳𝗲𝗿𝘀 𝗡𝗩𝗜𝗗𝗜𝗔 𝗕𝗹𝗮𝗰𝗸𝘄𝗲𝗹𝗹-𝗯𝗮𝘀𝗲𝗱 𝘀𝗲𝗿𝘃𝗲𝗿𝘀 👉 8U/10U servers • NVIDIA HGX™ B200 8-GPU baseboard • 2x AMD EPYC™ 9004/9005 OR 2x 4th/5th Gen Intel® Xeon® Scalable OR 2x Intel® Xeon® 6900 series Learn more colfax-intl.com/ServerList.asp…

𝗖𝗼𝗹𝗳𝗮𝘅 𝗻𝗼𝘄 𝗼𝗳𝗳𝗲𝗿𝘀 𝗡𝗩𝗜𝗗𝗜𝗔 𝗕𝗹𝗮𝗰𝗸𝘄𝗲𝗹𝗹-𝗯𝗮𝘀𝗲𝗱 𝘀𝗲𝗿𝘃𝗲𝗿𝘀

👉 8U/10U servers
• NVIDIA HGX™ B200 8-GPU baseboard
• 2x AMD EPYC™ 9004/9005 OR 2x 4th/5th Gen Intel® Xeon® Scalable OR 2x Intel® Xeon® 6900 series

Learn more colfax-intl.com/ServerList.asp…
Colfax International (@colfaxintl) 's Twitter Profile Photo

The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier

The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier
Vijay (@__tensorcore__) 's Twitter Profile Photo

xjdr sure! colfax just dropped a banger today: research.colfax-intl.com/cutlass-tutori… We have a series of really nice incremental tutorials: github.com/NVIDIA/cutlass… The two GTC talks as I mentioned are the richest source out there on Blackwell MMAs. TK blog is nice too.