Aaryan Singhal (@aaryansinghal4) 's Twitter Profile
Aaryan Singhal

@aaryansinghal4

cs @ stanford

ID: 1342483051641794561

linkhttps://www.aaryan-singhal.com/ calendar_today25-12-2020 14:51:29

156 Tweet

429 Takipçi

563 Takip Edilen

hazyresearch (@hazyresearch) 's Twitter Profile Photo

The Great American AI Race. I wrote something about how we need a holistic AI effort from academia, industry, and the US government to have the best shot at a freer, better educated, and healthier world in AI. I’m a mega bull on the US and open source AI. Maybe we’re cooking

The Great American AI Race. I wrote something about how we need a holistic AI effort from academia, industry, and the US government to have the best shot at a freer, better educated, and healthier world in AI. I’m a mega bull on the US and open source AI. Maybe we’re cooking
Together AI (@togethercompute) 's Twitter Profile Photo

Our latest joint work w/ SandyResearch @ UCSD: training-free acceleration of Diffusion Transformers w/ dynamic sparsity, led by Austin Silveria soham! ⚡️ 3.7x faster video and 1.6x faster image generation while preserving quality! 🧵 Open-source code & CUDA kernels!

Austin Silveria (@austinsilveria) 's Twitter Profile Photo

Training-free acceleration of Diffusion Transformers with dynamic sparsity and cross-step attention/MLP deltas--collaboration with soham and Dan Fu! ⚡️ 3.7x faster video and 1.6x faster image generation while preserving quality! 🧵 Open-source code & CUDA kernels!

Dan Fu (@realdanfu) 's Twitter Profile Photo

Super excited to share Chipmunk 🐿️- training-free acceleration of diffusion transformers (video, image generation) with dynamic attention & MLP sparsity! Led by Austin Silveria, soham - 3.7x faster video gen, 1.6x faster image gen. Kernels written in TK ⚡️🐱 1/

soham (@sohamgovande) 's Twitter Profile Photo

introducing chipmunk—a training-free algorithm making ai video generation 3.7x & image gen 1.6x faster! ⚡️ our kernels for column-sparse attention are 9.3x faster than FlashAttention-3 and column-sparse GEMM is 2.5x faster vs. cuBLAS a thread on the GPU kernel optimizations 🧵

Benjamin F Spector (@bfspector) 's Twitter Profile Photo

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces.

So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel.

Megakernels are faster & more humane. Here’s how to treat your Llamas ethically:

(Joint
Stuart Sul (@stuart_sul) 's Twitter Profile Photo

GPU kernel launches are expensive--so we fused the entire Llama-1B into a single kernel. Very excited to kick off our megakernel framework series with Thunderkittens hazyresearch. More coming soon!

Jordan Juravsky (@jordanjuravsky) 's Twitter Profile Photo

We wrote a megakernel! Excited to share how we fused Llama-1B into a single kernel to reach SOTA latency. Check out our blog post and code below!

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

So so so cool. Llama 1B batch one inference in one single CUDA kernel, deleting synchronization boundaries imposed by breaking the computation into a series of kernels called in sequence. The *optimal* orchestration of compute and memory is only achievable in this way.

Owen Dugan (@owendugan) 's Twitter Profile Photo

A megakernel for Llama!🦙 We built a single kernel for the entire Llama 1B forward pass, enabling >1000 tokens/s on a single H100 and almost 1500 tokens/s on a single B200! Check it out!

Austin Silveria (@austinsilveria) 's Twitter Profile Photo

chipmunk is up on arxiv! across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps caching + sparsity speeds up generation by only recomputing fast changing activations

chipmunk is up on arxiv!

across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps

caching + sparsity speeds up generation by only recomputing fast changing activations
Dan Fu (@realdanfu) 's Twitter Profile Photo

Some updates to Chipmunk! 🐿️ Chipmunk now supports Wan 2.1, with up to 2.67x speedup - completely training-free! The paper is up on arXiv - take a look to see more in-depth analysis of sparsity in video models. Only 5-25% of activations account for >90% of the output!

Tanvir Bhathal (@bhathaltanvir0) 's Twitter Profile Photo

Super excited to announce Weaver! Check it out to see the strongest way to verify LM Generations while maintaining compute efficiency!

Brendan McLaughlin (@brendanm0407) 's Twitter Profile Photo

Thrilled to share that I’ve joined Reflection AI! We’re building superintelligent autonomous systems by co-designing research and product. Today, we’re launching Asimov. As AI benchmarks saturate, evaluation will increasingly live inside real-world products that are

Robby Manihani (@robbymanihani) 's Twitter Profile Photo

Today we're announcing Pace, where we are building the world's first Agent Process Outsourcer for insurance operations. Traditional industry runs on legacy BPOs and consultants, and we're reimagining it. Our agent can handle documents of any length, conduct complex reasoning, and