Beidi Chen (@beidichen) 's Twitter Profile
Beidi Chen

@beidichen

Asst. Prof @CarnegieMellon, Visiting Researcher @Meta, Postdoc @Stanford, Ph.D. @RiceUniversity, Large-Scale ML, a fan of Dota2.

ID: 424387623

linkhttps://www.andrew.cmu.edu/user/beidic/ calendar_today29-11-2011 18:22:36

461 Tweet

14,14K Takipçi

375 Takip Edilen

Jordan Juravsky (@jordanjuravsky) 's Twitter Profile Photo

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with Ayush Chakravarthy, Ryan Ehrlich, Sabri Eyuboglu, Bradley Brown, Joseph Shetaye,

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models.

(Joint work with <a href="/achakravarthy01/">Ayush Chakravarthy</a>, <a href="/ryansehrlich/">Ryan Ehrlich</a>, <a href="/EyubogluSabri/">Sabri Eyuboglu</a>, <a href="/brad19brown/">Bradley Brown</a>, <a href="/jshetaye/">Joseph Shetaye</a>,
Beidi Chen (@beidichen) 's Twitter Profile Photo

📢 Can't be more excited about this scaling law study. It reveals two important points: (1) The current Test-Time Strategies are not scalable (bottlenecked by O(N^2) memory access) w.r.p. to the nature of hardware (FLOPS grows much faster than memory bandwidth) (2) While

Tim Dettmers (@tim_dettmers) 's Twitter Profile Photo

You gotta love those scaling laws! Very insightful work. Also one of the few applications of sparsity that seems to work really well. Would love to see more work like this!

𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8) 's Twitter Profile Photo

Kinetics: Rethinking Test-Time Scaling Laws Dense strategies like BoN and Long-CoT hit an O(N²) KV bottleneck, TTS isn’t FLOP-bound 𝘢𝘯𝘥 𝘯𝘦𝘷𝘦𝘳 𝘳𝘦𝘢𝘭𝘭𝘺 𝘸𝘢𝘴 - block top-k sparse attention cuts per-token cost - enables longer generations and more parallel trials on

Kinetics: Rethinking Test-Time Scaling Laws

Dense strategies like BoN and Long-CoT hit an O(N²) KV bottleneck, TTS isn’t FLOP-bound 𝘢𝘯𝘥 𝘯𝘦𝘷𝘦𝘳 𝘳𝘦𝘢𝘭𝘭𝘺 𝘸𝘢𝘴

- block top-k sparse attention cuts per-token cost
- enables longer generations and more parallel trials on
Xun Huang (@xunhuang1995) 's Twitter Profile Photo

Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for training autoregressive diffusion models. The key to high quality? Simulate the inference process during training by unrolling transformers with KV caching.

Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. Amrith Setlur & Matthew Yang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems.

<a href="/setlur_amrith/">Amrith Setlur</a> &amp; <a href="/matthewyryang/">Matthew Yang</a>'s new work e3 shows how RL done with this view produces best &lt;2B LLM on math that extrapolates beyond training budget. 🧵⬇️
Beidi Chen (@beidichen) 's Twitter Profile Photo

Say hello to Multiverse — the Everything Everywhere All At Once of generative modeling. 💥 Lossless, adaptive, and gloriously parallel 🌀 Now open-sourced: multiverse4fm.github.io I was amazed how easily we could extract the intrinsic parallelism of even SOTA autoregressive

Xinyu Yang (@xinyu2ml) 's Twitter Profile Photo

🚀 Super excited to share Multiverse! 🏃 It’s been a long journey exploring the space between model design and hardware efficiency. What excites me most is realizing that, beyond optimizing existing models, we can discover better model architectures by embracing system-level

chen zhuoming (@chenzhuoming911) 's Twitter Profile Photo

The calculation of the scaling is, unfortunately, wrong. As we discussed in a recent paper, Kinetics, arxiv.org/abs/2506.05333, the bottleneck of inference time scaling is the KV memory access, rather than FLOPs! Unless your target scenario is Ollama for a single user. (That's

Beidi Chen (@beidichen) 's Twitter Profile Photo

Hello MiniMax (official) exciting model but questionable claim on its better reasoning scaling than DeepSeek and Qwen. Nice try on reasoning longer to be SOTA but using flops to quantify the cost in Test-time scaling doesn’t work for hybrid model 🫣 chen zhuoming has

Haizhong (@haizhong_zheng) 's Twitter Profile Photo

Rollouts are a major bottleneck in RL training for LLMs. Our new proposed RL training method, GRESO, lets RL focus on high-value prompts— large saving rollout time, and accelerating training. 🚀