Cade Daniel 🇺🇸 (@cdnamz) Twitter Tweets • TwiCopy

Arthur Douillard

a year ago

KV Prediction for Improved Time to First Token LLM inference can be split in two phases: Prefilling and Decoding. The decoding phase is in autoregressive mode, where tokens are generating one by one, by re-using previous Key/Value tensors in the KV-cache. To speed up that

thumb_up_off_alt208

chat_bubble_outline3

repeat47

shareShare

Simran Arora

@simran_s_arora

a year ago

Want Llama 405B, but wish it scaled linearly in sequence length??? Enter LoLCATS: an efficient method for "turning Transformers to linear attention models", all on an academic budget!! We use LoLCATS to linearize the *full Llama 3.1 model family* for the first time – 20+ points

thumb_up_off_alt656

chat_bubble_outline9

repeat89

shareShare

Andreas Köpf

@neurosp1ke

a year ago

If you are interested in the latest GPU MODE news (upcoming lectures, videos etc.) please follow our new official twitter/x account: GPU MODE

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare

vLLM

@vllm_project

10 months ago

Speculative decoding is one of the best tool in the vLLM's suite of inference optimization tool box, accelerating the inference without accuracy loss. Checkout our blog post for more details about the state of spec decode in vLLM today! 🧵 blog.vllm.ai/2024/10/17/spe…

thumb_up_off_alt239

chat_bubble_outline5

repeat52

shareShare

Jerry Tworek

@millionint

10 months ago

ARR is the only meaningful AGI metric

thumb_up_off_alt71

chat_bubble_outline6

repeat6

shareShare

Michael Matthews @ ICLR 2025

@mitrma

10 months ago

🍎 The core of Kinetix is our new 2D rigid body physics engine: Jax2D. This is a minimal rewrite of the classic Box2D engine made by Erin Catto. Jax2D allows us to run thousands of heterogeneous parallel environments on a single GPU (yes, you can vmap over different tasks!) 8/

thumb_up_off_alt40

chat_bubble_outline4

repeat4

shareShare

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

10 months ago

Pie: Pooling CPU Memory for LLM Inference paper: arxiv.org/abs/2411.09317 Pie is an LLM inference framework that tackles the memory challenges of large models by enabling efficient GPU-CPU memory swapping and adaptive expansion. It optimizes memory usage without increasing

thumb_up_off_alt172

chat_bubble_outline1

repeat42

shareShare

Rohan Choudhury

@rchoudhury997

10 months ago

Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!

thumb_up_off_alt1,1K

chat_bubble_outline22

repeat173

shareShare

Vima Gupta

@vima_gupta

10 months ago

1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode

thumb_up_off_alt33

chat_bubble_outline4

repeat7

shareShare

Suhail

@suhail

10 months ago

Once the AI labs realize they need to make products for survival, they will immediately reformulate their strategy to competing with the most obvious working thing that is vaguely under the guise of the original mission. You should presume you will be ruthlessly copied.

thumb_up_off_alt779

chat_bubble_outline31

repeat40

shareShare

Cade Daniel 🇺🇸

@cdnamz

9 months ago

love finding bangers so damn good they force a follow

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

Grad

@grad62304977

8 months ago

People waking up to take their bitter lesson pill x.com/rm_rafailov/st…

thumb_up_off_alt91

chat_bubble_outline3

repeat3

shareShare

Cade Daniel 🇺🇸

@cdnamz

7 months ago

Congrats!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Cade Daniel 🇺🇸

@cdnamz

7 months ago

Welcome Ion Stoica

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare

Simon Guo 🦝

@simonguozirui

6 months ago

LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇

thumb_up_off_alt305

chat_bubble_outline9

repeat68

shareShare

Shanli Xing

@0xsling0

6 months ago

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

thumb_up_off_alt180

chat_bubble_outline1

repeat33

shareShare

Hongyang Zhang

@hongyangzh

6 months ago

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new

thumb_up_off_alt302

chat_bubble_outline14

repeat42

shareShare

Hrishbh Dalal

@hrishbhdalal

5 months ago

What if we could teach LLMs to be algorithm inventors? I trained an LLM to improve sorting algorithms through pure reinforcement learning - and it discovered optimizations giving 47.92x speedups over an optimized python based Timsort baseline! No cold-start data needed. I used

thumb_up_off_alt801

chat_bubble_outline20

repeat63

shareShare

Jonathan Frankle

@jefrankle

5 months ago

The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, Databricks introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.

The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, <a href="/databricks/">Databricks</a> introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.

thumb_up_off_alt909

chat_bubble_outline13

repeat138

shareShare

Simran Arora

@simran_s_arora

5 months ago

BASED ✌️ turns 1! One year since its launch at NeurIPS 2023 — and it's helped shape the new wave of efficient LMs. ⚡️ Fastest linear attention kernels 🧠 405B models trained on 16 GPUs 💥 Inspired Mamba-v2, RWKVs, MiniMax Checkout our retrospective below!

thumb_up_off_alt110

chat_bubble_outline3

repeat66

shareShare