hazyresearch (@hazyresearch) 's Twitter Profile
hazyresearch

@hazyresearch

A research group in @StanfordAILab working on the foundations of machine learning & systems. hazyresearch.stanford.edu Ostensibly supervised by Chris Ré

ID: 747538968

linkhttp://cs.stanford.edu/people/chrismre/ calendar_today09-08-2012 16:46:27

1,1K Tweet

8,8K Takipçi

1,1K Takip Edilen

Austin Silveria (@austinsilveria) 's Twitter Profile Photo

chipmunk is up on arxiv! across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps caching + sparsity speeds up generation by only recomputing fast changing activations

chipmunk is up on arxiv!

across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps

caching + sparsity speeds up generation by only recomputing fast changing activations
Jordan Juravsky (@jordanjuravsky) 's Twitter Profile Photo

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with Ayush Chakravarthy, Ryan Ehrlich, Sabri Eyuboglu, Bradley Brown, Joseph Shetaye,

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models.

(Joint work with <a href="/achakravarthy01/">Ayush Chakravarthy</a>, <a href="/ryansehrlich/">Ryan Ehrlich</a>, <a href="/EyubogluSabri/">Sabri Eyuboglu</a>, <a href="/brad19brown/">Bradley Brown</a>, <a href="/jshetaye/">Joseph Shetaye</a>,
Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

In the test time scaling era, we all would love a higher throughput serving engine! Introducing Tokasaurus, a LLM inference engine for high-throughput workloads with large and small models! Led by Jordan Juravsky, in collaboration with hazyresearch and an amazing team!

In the test time scaling era, we all would love a higher throughput serving engine! Introducing Tokasaurus, a LLM inference engine for high-throughput workloads with large and small models!

Led by <a href="/jordanjuravsky/">Jordan Juravsky</a>, in collaboration with <a href="/HazyResearch/">hazyresearch</a> and an amazing team!
Infini-AI-Lab (@infiniailab) 's Twitter Profile Photo

🥳 Happy to share our new work –  Kinetics: Rethinking Test-Time Scaling Laws 🤔How to effectively build a powerful reasoning agent? Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model. But, It only shows half of the picture! 🚨 The O(N²)

🥳 Happy to share our new work –  Kinetics: Rethinking Test-Time Scaling Laws

🤔How to effectively build a powerful reasoning agent?

Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model &gt; 32B model.
But, It only shows half of the picture!

🚨 The O(N²)
Beidi Chen (@beidichen) 's Twitter Profile Photo

📢 Can't be more excited about this scaling law study. It reveals two important points: (1) The current Test-Time Strategies are not scalable (bottlenecked by O(N^2) memory access) w.r.p. to the nature of hardware (FLOPS grows much faster than memory bandwidth) (2) While

CMU School of Computer Science (@scsatcmu) 's Twitter Profile Photo

Virginia Smith, the Leonardo Associate Professor of Machine Learning, has received the Air Force Office of Scientific Research 2025 Young Investigator award. cs.cmu.edu/news/2025/smit…

Sabri Eyuboglu (@eyuboglusabri) 's Twitter Profile Photo

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size.

What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x
Simran Arora (@simran_s_arora) 's Twitter Profile Photo

There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:

There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:
Hermann (@kumbonghermann) 's Twitter Profile Photo

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week.

VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from
Dan Fu (@realdanfu) 's Twitter Profile Photo

Announcing HMAR - Efficient Hierarchical Masked Auto-Regressive Image Generation, led by Hermann! HMAR is hardware-efficient, reformulates autoregressive image generation in a way that can take advantage of tensor cores. Hermann is presenting it at CVPR this week!

Cartesia (@cartesia_ai) 's Twitter Profile Photo

Building voice agents? Meet Ink-Whisper: the fastest, most affordable streaming speech-to-text model. 🌎 Optimized for accuracy in real-world conditions 👯 Pair with our Sonic text-to-speech → fastest duo in voice AI 🔌 Plugs into Vapi,Pipecat AI, LiveKit Read more:

Building voice agents? Meet Ink-Whisper: the fastest, most affordable streaming speech-to-text model. 

 🌎 Optimized for accuracy in real-world conditions
 👯 Pair with our Sonic text-to-speech → fastest duo in voice AI
 🔌 Plugs into <a href="/Vapi_AI/">Vapi</a>,<a href="/pipecat_ai/">Pipecat AI</a>, <a href="/livekit/">LiveKit</a> 

Read more:
Chris Lattner (@clattner_llvm) 's Twitter Profile Photo

From a single (small!) binary, modular provides industry leading performance on AMD MI300/325 (up to 50% faster than VLLM 0.9!) runs with top speed on NVIDIA H100 and is previewing SotA Blackwell support. It’s also the best way to accel trad Python! 😘

Alex Ratner (@ajratner) 's Twitter Profile Photo

Scale alone is not enough for AI data. Quality and complexity are equally critical. Excited to support all of these for LLM developers with Snorkel AI Data-as-a-Service, and to share our new leaderboard! — Our decade-plus of research and work in AI data has a simple point:

James Zou (@james_y_zou) 's Twitter Profile Photo

Excited to introduce Open Data Scientist: ✅outperforms Gemini data science agent ✅solves real Kaggle tasks ✅fully open source, easy to adapt ✅sandbox for safe exec Step-by-step tutorial on building our agent together.ai/blog/building-… Great job Federico Bianchi Shang Zhu

Cartesia (@cartesia_ai) 's Twitter Profile Photo

👑 We’re #1! Sonic-2 leads @Labelbox’s Speech Generation Leaderboard topping out in speech quality, word error rate, and naturalness. Build your real-time voice apps with the 🥇 best voice AI model. ➡️ labelbox.com/leaderboards/s…

Beidi Chen (@beidichen) 's Twitter Profile Photo

Say hello to Multiverse — the Everything Everywhere All At Once of generative modeling. 💥 Lossless, adaptive, and gloriously parallel 🌀 Now open-sourced: multiverse4fm.github.io I was amazed how easily we could extract the intrinsic parallelism of even SOTA autoregressive

soham (@sohamgovande) 's Twitter Profile Photo

Chipmunks can now hop across multiple GPU architectures (sm_80, sm_89, sm_90). You can get a 1.4-3x lossless speedup when generating videos on A100s, 4090s, and H100s! Chipmunks also play with more open-source models: Mochi, Wan, & others (w/ tutorials for integration) 🐿️

Chipmunks can now hop across multiple GPU architectures (sm_80, sm_89, sm_90). You can get a 1.4-3x lossless speedup when generating videos on A100s, 4090s, and H100s!

Chipmunks also play with more open-source models: Mochi, Wan, &amp; others (w/ tutorials for integration) 🐿️