hazyresearch (@hazyresearch) Twitter Tweets • TwiCopy

Avanika Narayan

@avanika15

5 months ago

minions ❤️ ollama. e2e security for local-cloud lm collaboration. check it out 👇

thumb_up_off_alt18

chat_bubble_outline2

repeat6

shareShare

Dan Biderman

@dan_biderman

5 months ago

Local LLMs *privately* collaborating with smarter cloud LLMs, as if you never left your laptop. Pure joy to work with ollama.

thumb_up_off_alt33

chat_bubble_outline1

repeat8

shareShare

chipmunk is up on arxiv! across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps caching + sparsity speeds up generation by only recomputing fast changing activations

thumb_up_off_alt20

chat_bubble_outline1

repeat7

shareShare

Jordan Juravsky

@jordanjuravsky

5 months ago

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with Ayush Chakravarthy, Ryan Ehrlich, Sabri Eyuboglu, Bradley Brown, Joseph Shetaye,

thumb_up_off_alt168

chat_bubble_outline3

repeat38

shareShare

Azalia Mirhoseini

@azaliamirh

5 months ago

In the test time scaling era, we all would love a higher throughput serving engine! Introducing Tokasaurus, a LLM inference engine for high-throughput workloads with large and small models! Led by Jordan Juravsky, in collaboration with hazyresearch and an amazing team!

thumb_up_off_alt139

chat_bubble_outline2

repeat20

shareShare

Infini-AI-Lab

@infiniailab

5 months ago

🥳 Happy to share our new work – Kinetics: Rethinking Test-Time Scaling Laws 🤔How to effectively build a powerful reasoning agent? Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model. But, It only shows half of the picture! 🚨 The O(N²)

thumb_up_off_alt239

chat_bubble_outline5

repeat65

shareShare

Beidi Chen

@beidichen

5 months ago

📢 Can't be more excited about this scaling law study. It reveals two important points: (1) The current Test-Time Strategies are not scalable (bottlenecked by O(N^2) memory access) w.r.p. to the nature of hardware (FLOPS grows much faster than memory bandwidth) (2) While

thumb_up_off_alt130

chat_bubble_outline1

repeat13

shareShare

CMU School of Computer Science

@scsatcmu

5 months ago

Virginia Smith, the Leonardo Associate Professor of Machine Learning, has received the Air Force Office of Scientific Research 2025 Young Investigator award. cs.cmu.edu/news/2025/smit…

thumb_up_off_alt25

chat_bubble_outline3

repeat3

shareShare

Sabri Eyuboglu

@eyuboglusabri

5 months ago

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

thumb_up_off_alt287

chat_bubble_outline12

repeat66

shareShare

Sabri Eyuboglu

@eyuboglusabri

5 months ago

Blog: hazyresearch.stanford.edu/blog/2025-06-0… Github: github.com/HazyResearch/c…

thumb_up_off_alt47

chat_bubble_outline2

repeat3

shareShare

Simran Arora

@simran_s_arora

5 months ago

There’s been tons of work on KV-cache compression and KV-cache free Transformer-alternatives (SSMs, linear attention) models for long-context, but we know there’s no free lunch with these methods. The quality-memory tradeoffs are annoying. *Is all lost?* Introducing CARTRIDGES:

thumb_up_off_alt297

chat_bubble_outline5

repeat32

shareShare

Hermann

@kumbonghermann

5 months ago

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

thumb_up_off_alt49

chat_bubble_outline1

repeat21

shareShare

Dan Fu

@realdanfu

5 months ago

Announcing HMAR - Efficient Hierarchical Masked Auto-Regressive Image Generation, led by Hermann! HMAR is hardware-efficient, reformulates autoregressive image generation in a way that can take advantage of tensor cores. Hermann is presenting it at CVPR this week!

thumb_up_off_alt19

chat_bubble_outline0

repeat5

shareShare

Cartesia

@cartesia_ai

5 months ago

Building voice agents? Meet Ink-Whisper: the fastest, most affordable streaming speech-to-text model. 🌎 Optimized for accuracy in real-world conditions 👯 Pair with our Sonic text-to-speech → fastest duo in voice AI 🔌 Plugs into Vapi,Pipecat AI, LiveKit Read more:

thumb_up_off_alt85

chat_bubble_outline3

repeat23

shareShare

Chris Lattner

@clattner_llvm

5 months ago

From a single (small!) binary, modular provides industry leading performance on AMD MI300/325 (up to 50% faster than VLLM 0.9!) runs with top speed on NVIDIA H100 and is previewing SotA Blackwell support. It’s also the best way to accel trad Python! 😘

thumb_up_off_alt184

chat_bubble_outline9

repeat19

shareShare

Alex Ratner

@ajratner

5 months ago

Scale alone is not enough for AI data. Quality and complexity are equally critical. Excited to support all of these for LLM developers with Snorkel AI Data-as-a-Service, and to share our new leaderboard! — Our decade-plus of research and work in AI data has a simple point:

thumb_up_off_alt142

chat_bubble_outline15

repeat33

shareShare

James Zou

@james_y_zou

5 months ago

Excited to introduce Open Data Scientist: ✅outperforms Gemini data science agent ✅solves real Kaggle tasks ✅fully open source, easy to adapt ✅sandbox for safe exec Step-by-step tutorial on building our agent together.ai/blog/building-… Great job Federico Bianchi Shang Zhu

thumb_up_off_alt83

chat_bubble_outline1

repeat9

shareShare

Cartesia

@cartesia_ai

5 months ago

👑 We’re #1! Sonic-2 leads @Labelbox’s Speech Generation Leaderboard topping out in speech quality, word error rate, and naturalness. Build your real-time voice apps with the 🥇 best voice AI model. ➡️ labelbox.com/leaderboards/s…

thumb_up_off_alt31

chat_bubble_outline0

repeat8

shareShare

Beidi Chen

@beidichen

5 months ago

Say hello to Multiverse — the Everything Everywhere All At Once of generative modeling. 💥 Lossless, adaptive, and gloriously parallel 🌀 Now open-sourced: multiverse4fm.github.io I was amazed how easily we could extract the intrinsic parallelism of even SOTA autoregressive

thumb_up_off_alt66

chat_bubble_outline2

repeat19

shareShare

soham

@sohamgovande

5 months ago

Chipmunks can now hop across multiple GPU architectures (sm_80, sm_89, sm_90). You can get a 1.4-3x lossless speedup when generating videos on A100s, 4090s, and H100s! Chipmunks also play with more open-source models: Mochi, Wan, & others (w/ tutorials for integration) 🐿️

thumb_up_off_alt12

chat_bubble_outline2

repeat3

shareShare

hazyresearch

Avanika Narayan

Dan Biderman

Austin Silveria

Jordan Juravsky

Azalia Mirhoseini

Infini-AI-Lab

Beidi Chen

CMU School of Computer Science

Sabri Eyuboglu

Sabri Eyuboglu

Simran Arora

Hermann

Dan Fu

Cartesia

Chris Lattner

Alex Ratner

James Zou

Cartesia

Beidi Chen

soham