Zhuohan Li (@zhuohan123) Twitter Tweets • TwiCopy

NomoreID

5 months ago

It seems my concerns were valid. This is the result of re-running the tests after changing the provider setting from the default (which automatically routed to Groq) to Fireworks. To emphasize again, the only thing I changed was explicitly fixing the provider in the code. All

thumb_up_off_alt202

chat_bubble_outline20

repeat13

shareShare

Romain Huet

@romainhuet

5 months ago

Heads-up for developers trying gpt-oss: performance and correctness can vary a bit across providers and runtimes right now due to implementation differences. We’re working with inference providers to make sure gpt-oss performs at its best everywhere, and we’d love your feedback!

thumb_up_off_alt347

chat_bubble_outline16

repeat28

shareShare

vLLM

@vllm_project

5 months ago

👀 we care a lot about correctness, ran many evals and stared at many tensors to compare them. numerics of vLLM on hopper should be solid and verified! if you run into any correctness issue on vLLM, we would love to know and debug them!

thumb_up_off_alt322

chat_bubble_outline5

repeat29

shareShare

Lily Liu

@eqhylxx

5 months ago

Yes, I’m very sure vllm is correct — we spent quite a bit of time on that. 🥹

thumb_up_off_alt144

chat_bubble_outline0

repeat9

shareShare

clem 🤗

@clementdelangue

5 months ago

Lots of conflicting takes about gpt-oss (yay open-source in the spotlight)! We’re powering the official @openai demo gpt-oss.com with HF inference providers thanks to Fireworks AI, Cerebras, Groq Inc and Together AI so we have a front-row seat of

thumb_up_off_alt383

chat_bubble_outline21

repeat49

shareShare

OpenAI

@openai

5 months ago

LIVE5TREAM THURSDAY 10AM PT

thumb_up_off_alt22,22K

chat_bubble_outline2,2K

repeat2,2K

shareShare

clem 🤗

@clementdelangue

5 months ago

thumb_up_off_alt683

chat_bubble_outline78

repeat43

shareShare

Costa Huang

@vwxyzjn

5 months ago

To demonstrate vLLM is a hackable for-loop. You *can* add requests in the middle of generations while still do batching properly.

To demonstrate <a href="/vllm_project/">vLLM</a> is a hackable for-loop.

You *can* add requests in the middle of generations while still do batching properly.

thumb_up_off_alt244

chat_bubble_outline4

repeat15

shareShare

dominik kundel

@dkundel

5 months ago

Inference providers have worked hard in the last week to make gpt-oss work well on their platforms. We just released a guide to help you verify API-compatibility & run your own evals. Additionally, Artificial Analysis started releasing per-provider evals for AIME, GPQA & IFBench 🧵

thumb_up_off_alt125

chat_bubble_outline8

repeat17

shareShare

Artificial Analysis

@artificialanlys

5 months ago

We've launched benchmarks of the accuracy of providers offering APIs for gpt-oss-120b We compare providers by running GPQA Diamond 16 times, AIME25 32 times, and IFBench 8 times. We report the median score across these runs alongside minimum, 25th percentile, 75th percentile and

thumb_up_off_alt836

chat_bubble_outline50

repeat84

shareShare

dominik kundel

@dkundel

5 months ago

30 days left to participate in our gpt-oss hackathon with Hugging Face NVIDIA ollama and vLLM 🏆 best overall 🤖 best in robotics 🧰 weirdest hardware 🔌 best local agent ⚙️ most useful fine-tune 🫠 best unexpected use 💕 best use for humanity Don't miss out 👇

30 days left to participate in our gpt-oss hackathon with <a href="/huggingface/">Hugging Face</a> <a href="/nvidia/">NVIDIA</a> <a href="/ollama/">ollama</a> and <a href="/vllm_project/">vLLM</a>

🏆 best overall
🤖 best in robotics
🧰 weirdest hardware
🔌 best local agent
⚙️ most useful fine-tune
🫠 best unexpected use
💕 best use for humanity

Don't miss out 👇

thumb_up_off_alt122

chat_bubble_outline7

repeat26

shareShare

Aleksa Gordić (水平问题)

@gordic_aleksa

4 months ago

New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up

thumb_up_off_alt2,2K

chat_bubble_outline62

repeat403

shareShare

vLLM

@vllm_project

3 months ago

How does DeepSeek Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

How does <a href="/deepseek_ai/">DeepSeek</a> Sparse Attention (DSA) work?

It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

thumb_up_off_alt718

chat_bubble_outline11

repeat110

shareShare

Zhuohan Li

@zhuohan123

3 months ago

Thank you NVIDIA for helping with the Day 0 support of DeepSeek-V3.2-Exp on vLLM!

thumb_up_off_alt5

chat_bubble_outline1

repeat0

shareShare

NVIDIA AI Developer

@nvidiaaidev

3 months ago

Big shoutout to the vLLM team for an exceptional showing in the SemiAnalysis InferenceMAX benchmark on NVIDIA Blackwell GPUs 👏 Built through close collaboration with our engineers, vLLM delivered consistently strong Blackwell performance gains across the Pareto

thumb_up_off_alt103

chat_bubble_outline3

repeat14

shareShare

vLLM

@vllm_project

3 months ago

Announcing the completely reimagined vLLM TPU! In collaboration with Google, we've launched a new high-performance TPU backend unifying PyTorch and JAX under a single lowering path for amazing performance and flexibility. 🚀 What's New? - JAX + Pytorch: Run PyTorch models on

Announcing the completely reimagined vLLM TPU! In collaboration with <a href="/Google/">Google</a>, we've launched a new high-performance TPU backend unifying <a href="/PyTorch/">PyTorch</a> and JAX under a single lowering path for amazing performance and flexibility.

🚀 What's New?
- JAX + Pytorch: Run PyTorch models on

thumb_up_off_alt969

chat_bubble_outline17

repeat122

shareShare

Andrej Karpathy

@karpathy

2 months ago

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language

thumb_up_off_alt9,9K

chat_bubble_outline423

repeat1,1K

shareShare

vLLM

@vllm_project

2 months ago

🚀 Excited to share our work on batch-invariant inference in vLLM! Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill!). Let's dive into how we built this 🧵👇

thumb_up_off_alt276

chat_bubble_outline2

repeat43

shareShare

Igor Babuschkin

@ibab

2 months ago

A common mistake that AI companies make nowadays is to not give their engineers enough time and mental calm to do their best work. Constant deadlines, pressure and distractions from daily AI news are poison for writing good code and systems that scale well. That’s why most AI

thumb_up_off_alt2,2K

chat_bubble_outline101

repeat136

shareShare

Zhuohan Li

@zhuohan123

2 months ago

Excited to share this work by Bram Wasti and team. Precise numerics are fundamental to stable RL, and we now have the core infra in OSS as well.

thumb_up_off_alt59

chat_bubble_outline2

repeat5

shareShare