Colfax International (@colfaxintl) 's Twitter Profile
Colfax International

@colfaxintl

HPC & AI Solutions (colfax-intl.com) | Research (research.colfax-intl.com) | Colfax Experience Center (experience.colfax-intl.com)

ID: 21418950

linkhttps://colfax-intl.com calendar_today20-02-2009 18:17:08

650 Tweet

964 Followers

8 Following

PyTorch (@pytorch) 's Twitter Profile Photo

Introducing FlashAttention-3 ๐Ÿš€ Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to Colfax International, AI at Meta, NVIDIA AI and Together AI for the collaboration here ๐Ÿ™Œ Read more in our blog: hubs.la/Q02Gf4hf0

Introducing FlashAttention-3 ๐Ÿš€ Fast and Accurate Attention with Asynchrony and Low-precision.

Thank you to <a href="/colfaxintl/">Colfax International</a>, <a href="/AIatMeta/">AI at Meta</a>, <a href="/NVIDIAAI/">NVIDIA AI</a> and <a href="/togethercompute/">Together AI</a> for the collaboration here ๐Ÿ™Œ

Read more in our blog: hubs.la/Q02Gf4hf0
Vijay (@__tensorcore__) 's Twitter Profile Photo

FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with Tri Dao, Colfax International, Meta, and my Pradeep Ramani. We can get up to 760 TFLOP/s on head dim 128 forward pass!

FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with <a href="/tri_dao/">Tri Dao</a>, <a href="/colfaxintl/">Colfax International</a>, <a href="/Meta/">Meta</a>, and my <a href="/_prrama/">Pradeep Ramani</a>. We can get up to 760 TFLOP/s on head dim 128 forward pass!
Tri Dao (@tri_dao) 's Twitter Profile Photo

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. Weโ€™re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. Weโ€™re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS!
1/
Tri Dao (@tri_dao) 's Twitter Profile Photo

This project is a collab with Jay Shah & Ganesh Bikshandi (Colfax International), Ying Zhang (@meta), @DROP_ALL_TABLES and Pradeep Ramani (NVIDIA). Huge thanks to the CUTLASS team, cuDNN team, and Together AI, and Princeton PLI for their support! 9/9

Together AI (@togethercompute) 's Twitter Profile Photo

We are thrilled to release FlashAttention-3 in partnership with Meta , NVIDIA, Princeton University, and Colfax International. The improvements from FlashAttention-3 will result in: โ€ข More efficient GPU Utilization - up to 75% from 35%. โ€ข Better performance with lower precision while

Colfax International (@colfaxintl) 's Twitter Profile Photo

๐—™๐—ฟ๐—ฒ๐—ฒ ๐—ก๐—ฉ๐—œ๐——๐—œ๐—”ยฎ ๐—›๐Ÿญ๐Ÿฌ๐Ÿฌ ๐—ก๐—ฉ๐—Ÿ ๐—ง๐—ฒ๐˜€๐˜ ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ Supercharge #LLM Inference #Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs. ๐Ÿ‘‰๐˜“๐˜ฆ๐˜ข๐˜ณ๐˜ฏ ๐˜ฎ๐˜ฐ๐˜ณ๐˜ฆ experience.colfax-intl.com/project/nvidiaโ€ฆ #AI

๐—™๐—ฟ๐—ฒ๐—ฒ ๐—ก๐—ฉ๐—œ๐——๐—œ๐—”ยฎ ๐—›๐Ÿญ๐Ÿฌ๐Ÿฌ ๐—ก๐—ฉ๐—Ÿ ๐—ง๐—ฒ๐˜€๐˜ ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ
Supercharge #LLM Inference

#Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs.

๐Ÿ‘‰๐˜“๐˜ฆ๐˜ข๐˜ณ๐˜ฏ ๐˜ฎ๐˜ฐ๐˜ณ๐˜ฆ experience.colfax-intl.com/project/nvidiaโ€ฆ

#AI
Hieu Pham (@hyhieu226) 's Twitter Profile Photo

๐Ÿ“š๐Ÿง‘โ€๐ŸŽ“New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) research.colfax-intl.com/cutlass-tutoriโ€ฆ If you have run PyTorch, Jax, or FlashAttention-3 on an H100 GPU, you have used WGMMA. Arguably the most important primitive in the H100's Hopper architecture, WGMMA is the

๐Ÿ“š๐Ÿง‘โ€๐ŸŽ“New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) research.colfax-intl.com/cutlass-tutoriโ€ฆ

If you have run PyTorch, Jax, or FlashAttention-3 on an H100 GPU, you have used WGMMA.

Arguably the most important primitive in the H100's Hopper architecture, WGMMA is the
Colfax International (@colfaxintl) 's Twitter Profile Photo

๐—œ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ถ๐—ป๐—ด ๐—–๐—ผ๐—น๐—ณ๐—ฎ๐˜… ๐—”๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—›๐˜‚๐—ฏ: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more โ€” all before they are shipped to you The service is free for all Colfax customers. experience.colfax-intl.com/access-hub/

๐—œ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ถ๐—ป๐—ด ๐—–๐—ผ๐—น๐—ณ๐—ฎ๐˜… ๐—”๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—›๐˜‚๐—ฏ: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more โ€” all before they are shipped to you

The service is free for all Colfax customers.

experience.colfax-intl.com/access-hub/
Hieu Pham (@hyhieu226) 's Twitter Profile Photo

Cosmin Negruseri Tri Dao I am still learning it. Just like nobody knows all C++, nobody knows CUDA. For resources, I highly recommend the series of tutorials using CUTLASS and CUTE by Colfax International. I am obviously not biased ๐Ÿคฃ

Hieu Pham (@hyhieu226) 's Twitter Profile Photo

Checkout our newest CUDA tutorial. research.colfax-intl.com/cutlass-tutoriโ€ฆ The topic is software pipelining: overlap mem copying with compute to hide latency. We present the concept in the context of GEMM (matmul), but it's applicable everywhere. e.g., Flash Attention 2 and 3. Conceptually

Hieu Pham (@hyhieu226) 's Twitter Profile Photo

research.colfax-intl.com/epilogue_visitโ€ฆ A chess game typically has 3 phases: opening, middle game, and endgame. A GEMM (matmul) kernel typically has: prologue, main loop, and epilogue. Just like chess grandmasters must master endgames, GEMM programmers must master epilogue techniques to write

Colfax International (@colfaxintl) 's Twitter Profile Photo

In this GPU MODE lecture, Jay Shah, Research Scientist at Colfax International, presents his joint work on FlashAttention-3 and how to implement the main compute loop in the algorithm using CUTLASS. research.colfax-intl.com/gpu-mode-cutlaโ€ฆ

Colfax International (@colfaxintl) 's Twitter Profile Photo

In this blog post, Jay Shah, Research Scientist at Colfax International, collaborated with Character.AI to explain two techniques (INT8 Quantization and Query Head Packing for MQA/GQA) that are important for using FlashAttention-3 for inference research.character.ai/optimizing-ai-โ€ฆ

Colfax International (@colfaxintl) 's Twitter Profile Photo

๐—–๐—จ๐—ง๐—Ÿ๐—”๐—ฆ๐—ฆ ๐—ง๐˜‚๐˜๐—ผ๐—ฟ๐—ถ๐—ฎ๐—น: ๐—ฃ๐—ฒ๐—ฟ๐˜€๐—ถ๐˜€๐˜๐—ฒ๐—ป๐˜ ๐—ž๐—ฒ๐—ฟ๐—ป๐—ฒ๐—น๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ-๐—ž Final part of our three part series on writing optimized GEMM kernels for NVIDIA GPUs using CUTLASS library abstractions. research.colfax-intl.com/cutlass-tutoriโ€ฆ

Colfax International (@colfaxintl) 's Twitter Profile Photo

๐—–๐—ผ๐—น๐—ณ๐—ฎ๐˜… ๐—ป๐—ผ๐˜„ ๐—ผ๐—ณ๐—ณ๐—ฒ๐—ฟ๐˜€ ๐—ก๐—ฉ๐—œ๐——๐—œ๐—” ๐—•๐—น๐—ฎ๐—ฐ๐—ธ๐˜„๐—ฒ๐—น๐—น-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ฒ๐—ฟ๐˜€ ๐Ÿ‘‰ 8U/10U servers โ€ข NVIDIA HGXโ„ข B200 8-GPU baseboard โ€ข 2x AMD EPYCโ„ข 9004/9005 OR 2x 4th/5th Gen Intelยฎ Xeonยฎ Scalable OR 2x Intelยฎ Xeonยฎ 6900 series Learn more colfax-intl.com/ServerList.aspโ€ฆ

๐—–๐—ผ๐—น๐—ณ๐—ฎ๐˜… ๐—ป๐—ผ๐˜„ ๐—ผ๐—ณ๐—ณ๐—ฒ๐—ฟ๐˜€ ๐—ก๐—ฉ๐—œ๐——๐—œ๐—” ๐—•๐—น๐—ฎ๐—ฐ๐—ธ๐˜„๐—ฒ๐—น๐—น-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ฒ๐—ฟ๐˜€

๐Ÿ‘‰ 8U/10U servers
โ€ข NVIDIA HGXโ„ข B200 8-GPU baseboard
โ€ข 2x AMD EPYCโ„ข 9004/9005 OR 2x 4th/5th Gen Intelยฎ Xeonยฎ Scalable OR 2x Intelยฎ Xeonยฎ 6900 series

Learn more colfax-intl.com/ServerList.aspโ€ฆ
Colfax International (@colfaxintl) 's Twitter Profile Photo

The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier

The DeepSeek technical reports contain a wealth of information on performance optimization techniques for NVIDIA GPUs. In this short blog, we explain two aspects of their FP8 mixed-precision training methodology that build on the techniques we've been teaching in our earlier
Vijay (@__tensorcore__) 's Twitter Profile Photo

xjdr sure! colfax just dropped a banger today: research.colfax-intl.com/cutlass-tutoriโ€ฆ We have a series of really nice incremental tutorials: github.com/NVIDIA/cutlassโ€ฆ The two GTC talks as I mentioned are the richest source out there on Blackwell MMAs. TK blog is nice too.

Hieu Pham (@hyhieu226) 's Twitter Profile Photo

research.colfax-intl.com/cutlass-tutoriโ€ฆ Their content always comes out in great quantity and quality โค๏ธ