Zhuohan Li (@zhuohan123) 's Twitter Profile
Zhuohan Li

@zhuohan123

👨🏻‍💻 cs phd @ 🌁 uc berkeley |
building @vllm_project |
machine learning systems |
the real agi is the friends we made along the way

ID: 243517888

linkhttps://zhuohan.li calendar_today27-01-2011 06:29:13

187 Tweet

5,5K Followers

797 Following

NomoreID (@hangsiin) 's Twitter Profile Photo

It seems my concerns were valid. This is the result of re-running the tests after changing the provider setting from the default (which automatically routed to Groq) to Fireworks. To emphasize again, the only thing I changed was explicitly fixing the provider in the code. All

It seems my concerns were valid.

This is the result of re-running the tests after changing the provider setting from the default (which automatically routed to Groq) to Fireworks.

To emphasize again, the only thing I changed was explicitly fixing the provider in the code. All
Romain Huet (@romainhuet) 's Twitter Profile Photo

Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across providers and runtimes right now due to implementation differences. We’re working with inference providers to make sure gpt-oss performs at its best everywhere, and we’d love your feedback!

vLLM (@vllm_project) 's Twitter Profile Photo

👀 we care a lot about correctness, ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified! if you run into any correctness issue on vLLM, we would love to know and debug them!

👀 we care a lot about correctness, ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified! if you run into any correctness issue on vLLM, we would love to know and debug them!
clem 🤗 (@clementdelangue) 's Twitter Profile Photo

Lots of conflicting takes about gpt-oss (yay open-source in the spotlight)! We’re powering the official @openai demo gpt-oss.com with HF inference providers thanks to Fireworks AI, Cerebras, Groq Inc and Together AI so we have a front-row seat of

Lots of conflicting takes about gpt-oss (yay open-source in the spotlight)! 

We’re powering the official @openai demo gpt-oss.com with HF inference providers thanks to <a href="/FireworksAI_HQ/">Fireworks AI</a>, <a href="/CerebrasSystems/">Cerebras</a>, <a href="/GroqInc/">Groq Inc</a> and <a href="/togethercompute/">Together AI</a> so we have a front-row seat of
Costa Huang (@vwxyzjn) 's Twitter Profile Photo

To demonstrate vLLM is a hackable for-loop. You *can* add requests in the middle of generations while still do batching properly.

To demonstrate <a href="/vllm_project/">vLLM</a> is a hackable for-loop. 

You *can* add requests in the middle of generations while still do batching properly.
dominik kundel (@dkundel) 's Twitter Profile Photo

Inference providers have worked hard in the last week to make gpt-oss work well on their platforms. We just released a guide to help you verify API-compatibility & run your own evals. Additionally, Artificial Analysis started releasing per-provider evals for AIME, GPQA & IFBench 🧵

Artificial Analysis (@artificialanlys) 's Twitter Profile Photo

We've launched benchmarks of the accuracy of providers offering APIs for gpt-oss-120b We compare providers by running GPQA Diamond 16 times, AIME25 32 times, and IFBench 8 times. We report the median score across these runs alongside minimum, 25th percentile, 75th percentile and

We've launched benchmarks of the accuracy of providers offering APIs for gpt-oss-120b

We compare providers by running GPQA Diamond 16 times, AIME25 32 times, and IFBench 8 times. We report the median score across these runs alongside minimum, 25th percentile, 75th percentile and
dominik kundel (@dkundel) 's Twitter Profile Photo

30 days left to participate in our gpt-oss hackathon with Hugging Face NVIDIA ollama and vLLM 🏆 best overall 🤖 best in robotics 🧰 weirdest hardware 🔌 best local agent ⚙️ most useful fine-tune 🫠 best unexpected use 💕 best use for humanity Don't miss out 👇

30 days left to participate in our gpt-oss hackathon with <a href="/huggingface/">Hugging Face</a> <a href="/nvidia/">NVIDIA</a> <a href="/ollama/">ollama</a> and <a href="/vllm_project/">vLLM</a> 

🏆 best overall
🤖 best in robotics
🧰 weirdest hardware
🔌 best local agent
⚙️ most useful fine-tune
🫠 best unexpected use
💕 best use for humanity

Don't miss out 👇
Aleksa Gordić (水平问题) (@gordic_aleksa) 's Twitter Profile Photo

New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up

New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work!

Took me a while to get this level of understanding of the codebase and then to write up
vLLM (@vllm_project) 's Twitter Profile Photo

How does DeepSeek Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

How does <a href="/deepseek_ai/">DeepSeek</a> Sparse Attention (DSA) work? 

It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
NVIDIA AI Developer (@nvidiaaidev) 's Twitter Profile Photo

Big shoutout to the vLLM team for an exceptional showing in the SemiAnalysis InferenceMAX benchmark on NVIDIA Blackwell GPUs 👏 Built through close collaboration with our engineers, vLLM delivered consistently strong Blackwell performance gains across the Pareto

vLLM (@vllm_project) 's Twitter Profile Photo

Announcing the completely reimagined vLLM TPU! In collaboration with Google, we've launched a new high-performance TPU backend unifying PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on

Announcing the completely reimagined vLLM TPU! In collaboration with <a href="/Google/">Google</a>, we've launched a new high-performance TPU backend unifying <a href="/PyTorch/">PyTorch</a> and JAX under a single lowering path for amazing performance and flexibility.

🚀 What's New?
- JAX + Pytorch: Run PyTorch models on
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language

vLLM (@vllm_project) 's Twitter Profile Photo

🚀 Excited to share our work on batch-invariant inference in vLLM! Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill!). Let's dive into how we built this 🧵👇

🚀 Excited to share our work on batch-invariant inference in vLLM! 
Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1
No more subtle differences between bs=1 and bs=N (including prefill!). Let's dive into how we built this 🧵👇
Igor Babuschkin (@ibab) 's Twitter Profile Photo

A common mistake that AI companies make nowadays is to not give their engineers enough time and mental calm to do their best work. Constant deadlines, pressure and distractions from daily AI news are poison for writing good code and systems that scale well. That’s why most AI

Zhuohan Li (@zhuohan123) 's Twitter Profile Photo

Excited to share this work by Bram Wasti and team. Precise numerics are fundamental to stable RL, and we now have the core infra in OSS as well.