
Colfax International
@colfaxintl
HPC & AI Solutions (colfax-intl.com) | Research (research.colfax-intl.com) | Colfax Experience Center (experience.colfax-intl.com)
ID: 21418950
https://colfax-intl.com 20-02-2009 18:17:08
650 Tweet
964 Followers
8 Following

Introducing FlashAttention-3 ๐ Fast and Accurate Attention with Asynchrony and Low-precision. Thank you to Colfax International, AI at Meta, NVIDIA AI and Together AI for the collaboration here ๐ Read more in our blog: hubs.la/Q02Gf4hf0


FlashAttention-3 is released! Over the last few months, I got the opportunity to collaborate on this amazing effort to implement FA-2 from scratch for H100 with Tri Dao, Colfax International, Meta, and my Pradeep Ramani. We can get up to 760 TFLOP/s on head dim 128 forward pass!



This project is a collab with Jay Shah & Ganesh Bikshandi (Colfax International), Ying Zhang (@meta), @DROP_ALL_TABLES and Pradeep Ramani (NVIDIA). Huge thanks to the CUTLASS team, cuDNN team, and Together AI, and Princeton PLI for their support! 9/9

We are thrilled to release FlashAttention-3 in partnership with Meta , NVIDIA, Princeton University, and Colfax International. The improvements from FlashAttention-3 will result in: โข More efficient GPU Utilization - up to 75% from 35%. โข Better performance with lower precision while

๐๐ฟ๐ฒ๐ฒ ๐ก๐ฉ๐๐๐๐ยฎ ๐๐ญ๐ฌ๐ฌ ๐ก๐ฉ๐ ๐ง๐ฒ๐๐ ๐๐ฟ๐ถ๐๐ฒ Supercharge #LLM Inference #Colfax is offering a FREE test drive that provides you remote access to a Colfax server with 4 #NVIDIA #H100 NVL Tensor Core #GPUs. ๐๐๐ฆ๐ข๐ณ๐ฏ ๐ฎ๐ฐ๐ณ๐ฆ experience.colfax-intl.com/project/nvidiaโฆ #AI


๐๐งโ๐New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) research.colfax-intl.com/cutlass-tutoriโฆ If you have run PyTorch, Jax, or FlashAttention-3 on an H100 GPU, you have used WGMMA. Arguably the most important primitive in the H100's Hopper architecture, WGMMA is the


๐๐ป๐๐ฟ๐ผ๐ฑ๐๐ฐ๐ถ๐ป๐ด ๐๐ผ๐น๐ณ๐ฎ๐ ๐๐ฐ๐ฐ๐ฒ๐๐ ๐๐๐ฏ: Securely validate and apply custom configurations to your systems, install specialized OS and SW, and much more โ all before they are shipped to you The service is free for all Colfax customers. experience.colfax-intl.com/access-hub/


Cosmin Negruseri Tri Dao I am still learning it. Just like nobody knows all C++, nobody knows CUDA. For resources, I highly recommend the series of tutorials using CUTLASS and CUTE by Colfax International. I am obviously not biased ๐คฃ


Checkout our newest CUDA tutorial. research.colfax-intl.com/cutlass-tutoriโฆ The topic is software pipelining: overlap mem copying with compute to hide latency. We present the concept in the context of GEMM (matmul), but it's applicable everywhere. e.g., Flash Attention 2 and 3. Conceptually

research.colfax-intl.com/epilogue_visitโฆ A chess game typically has 3 phases: opening, middle game, and endgame. A GEMM (matmul) kernel typically has: prologue, main loop, and epilogue. Just like chess grandmasters must master endgames, GEMM programmers must master epilogue techniques to write


In this blog post, Jay Shah, Research Scientist at Colfax International, collaborated with Character.AI to explain two techniques (INT8 Quantization and Query Head Packing for MQA/GQA) that are important for using FlashAttention-3 for inference research.character.ai/optimizing-ai-โฆ

๐๐จ๐ง๐๐๐ฆ๐ฆ ๐ง๐๐๐ผ๐ฟ๐ถ๐ฎ๐น: ๐ฃ๐ฒ๐ฟ๐๐ถ๐๐๐ฒ๐ป๐ ๐๐ฒ๐ฟ๐ป๐ฒ๐น๐ ๐ฎ๐ป๐ฑ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ-๐ Final part of our three part series on writing optimized GEMM kernels for NVIDIA GPUs using CUTLASS library abstractions. research.colfax-intl.com/cutlass-tutoriโฆ

๐๐ผ๐น๐ณ๐ฎ๐ ๐ป๐ผ๐ ๐ผ๐ณ๐ณ๐ฒ๐ฟ๐ ๐ก๐ฉ๐๐๐๐ ๐๐น๐ฎ๐ฐ๐ธ๐๐ฒ๐น๐น-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐๐ฒ๐ฟ๐๐ฒ๐ฟ๐ ๐ 8U/10U servers โข NVIDIA HGXโข B200 8-GPU baseboard โข 2x AMD EPYCโข 9004/9005 OR 2x 4th/5th Gen Intelยฎ Xeonยฎ Scalable OR 2x Intelยฎ Xeonยฎ 6900 series Learn more colfax-intl.com/ServerList.aspโฆ



