Cade Daniel 🇺🇸 (@cdnamz) 's Twitter Profile
Cade Daniel 🇺🇸

@cdnamz

systems performance

ID: 721517072

linkhttps://www.linkedin.com/in/cade-daniel calendar_today28-07-2012 04:53:31

343 Tweet

1,1K Takipçi

962 Takip Edilen

Arthur Douillard (@ar_douillard) 's Twitter Profile Photo

KV Prediction for Improved Time to First Token LLM inference can be split in two phases: Prefilling and Decoding. The decoding phase is in autoregressive mode, where tokens are generating one by one, by re-using previous Key/Value tensors in the KV-cache. To speed up that

KV Prediction for Improved Time to First Token

LLM inference can be split in two phases: Prefilling and Decoding.

The decoding phase is in autoregressive mode, where tokens are generating one by one, by re-using previous Key/Value tensors in the KV-cache. To speed up that
Simran Arora (@simran_s_arora) 's Twitter Profile Photo

Want Llama 405B, but wish it scaled linearly in sequence length??? Enter LoLCATS: an efficient method for "turning Transformers to linear attention models", all on an academic budget!! We use LoLCATS to linearize the *full Llama 3.1 model family* for the first time – 20+ points

Want Llama 405B, but wish it scaled linearly in sequence length??? Enter LoLCATS: an efficient method for "turning Transformers to linear attention models", all on an academic budget!! 

We use LoLCATS to linearize the *full Llama 3.1 model family* for the first time – 20+ points
Andreas Köpf (@neurosp1ke) 's Twitter Profile Photo

If you are interested in the latest GPU MODE news (upcoming lectures, videos etc.) please follow our new official twitter/x account: GPU MODE

vLLM (@vllm_project) 's Twitter Profile Photo

Speculative decoding is one of the best tool in the vLLM's suite of inference optimization tool box, accelerating the inference without accuracy loss. Checkout our blog post for more details about the state of spec decode in vLLM today! 🧵 blog.vllm.ai/2024/10/17/spe…

Michael Matthews @ ICLR 2025 (@mitrma) 's Twitter Profile Photo

🍎 The core of Kinetix is our new 2D rigid body physics engine: Jax2D. This is a minimal rewrite of the classic Box2D engine made by Erin Catto. Jax2D allows us to run thousands of heterogeneous parallel environments on a single GPU (yes, you can vmap over different tasks!) 8/

𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8) 's Twitter Profile Photo

Pie: Pooling CPU Memory for LLM Inference paper: arxiv.org/abs/2411.09317 Pie is an LLM inference framework that tackles the memory challenges of large models by enabling efficient GPU-CPU memory swapping and adaptive expansion. It optimizes memory usage without increasing

Rohan Choudhury (@rchoudhury997) 's Twitter Profile Photo

Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!

Vima Gupta (@vima_gupta) 's Twitter Profile Photo

1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode

1/7 🧵 MoEs: A tale of expectation vs reality

Marketing: "Only compute the expert parameters you need!"
Reality: Batch 16 requests → ALL experts activate
At serving time (vLLM/TGI), arithmetic intensity:
AI ≈ (num_tokens * top_k) / total_experts
In simpler terms: Your decode
Suhail (@suhail) 's Twitter Profile Photo

Once the AI labs realize they need to make products for survival, they will immediately reformulate their strategy to competing with the most obvious working thing that is vaguely under the guise of the original mission. You should presume you will be ruthlessly copied.

Simon Guo 🦝 (@simonguozirui) 's Twitter Profile Photo

LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇

LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench!

Turns out KernelBench is quite challenging 🧠 —  frontier models outperform the PyTorch Eager baseline &lt;20% of the time.

More 🧵👇
Shanli Xing (@0xsling0) 's Twitter Profile Photo

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling.

Our implementation achieves over 50% reduction in sampling time.

Blog post: flashinfer.ai/2025/03/10/sam…
Hongyang Zhang (@hongyangzh) 's Twitter Profile Photo

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new

Hrishbh Dalal (@hrishbhdalal) 's Twitter Profile Photo

What if we could teach LLMs to be algorithm inventors? I trained an LLM to improve sorting algorithms through pure reinforcement learning - and it discovered optimizations giving 47.92x speedups over an optimized python based Timsort baseline! No cold-start data needed. I used

What if we could teach LLMs to be algorithm inventors?

I trained an LLM to improve sorting algorithms through pure reinforcement learning - and it discovered optimizations giving 47.92x speedups over an optimized python based Timsort baseline! No cold-start data needed.

I used
Jonathan Frankle (@jefrankle) 's Twitter Profile Photo

The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, Databricks introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.

The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, <a href="/databricks/">Databricks</a> introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.
Simran Arora (@simran_s_arora) 's Twitter Profile Photo

BASED ✌️ turns 1! One year since its launch at NeurIPS 2023 — and it's helped shape the new wave of efficient LMs. ⚡️ Fastest linear attention kernels 🧠 405B models trained on 16 GPUs 💥 Inspired Mamba-v2, RWKVs, MiniMax Checkout our retrospective below!