Robert Shaw (@robertshaw21) Twitter Tweets • TwiCopy

SemiAnalysis

2 months ago

The Red Hat AI team contributes a lot to vLLM and does amazing work for the open-source community. Great to see vLLM performing so well compared to TRT-LLM on H200! vLLM comes pretty close to B200, with the NVIDIA AI team working on closing the gap for GPTOSS within the next

The <a href="/RedHat_AI/">Red Hat AI</a> team contributes a lot to vLLM and does amazing work for the open-source community. Great to see vLLM performing so well compared to TRT-LLM on H200! vLLM comes pretty close to B200, with the <a href="/NVIDIAAI/">NVIDIA AI</a> team working on closing the gap for GPTOSS within the next

thumb_up_off_alt100

chat_bubble_outline3

repeat15

shareShare

vLLM

@vllm_project

a month ago

Our first official vLLM Meetup is coming to Europe on Nov 6! 🇨🇭 Meet vLLM committers Michael Goin, Tyler Michael Smith, Thomas Parnell, + speakers from Red Hat AI, IBM, Mistral AI. Topics: vLLM updates, quantization, Mistral+vLLM, hybrid models, distributed inference luma.com/0gls27kb

thumb_up_off_alt36

chat_bubble_outline2

repeat13

shareShare

Kimi.ai

@kimi_moonshot

a month ago

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat193

shareShare

vLLM

@vllm_project

a month ago

🎉 Congrats to Kimi.ai! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡

🎉 Congrats to <a href="/Kimi_Moonshot/">Kimi.ai</a>! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA):

- RULER 128k context: 84.3 perf + 3.98× speedup
- Up to 6× faster decoding & 6.3× faster TPOT (1M tokens)
- 75% KV cache reduction

💡

thumb_up_off_alt255

chat_bubble_outline8

repeat32

shareShare

vLLM

@vllm_project

a month ago

🚀Excited to team up with NVIDIA AI Developer to bring Nemotron Nano 2 VL to vLLM - a multimodal model powered by a hybrid Transformer–Mamba language backbone, built for video understanding and document intelligence✨ Full post here👇blog.vllm.ai/2025/10/31/run…

thumb_up_off_alt77

chat_bubble_outline3

repeat12

shareShare

Hao AI Lab

@haoailab

a month ago

🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our

thumb_up_off_alt175

chat_bubble_outline7

repeat48

shareShare

vLLM

@vllm_project

a month ago

Love the retrospective on disaggregated inference. If you wonder where the technique named "PD" in vLLM comes from, read on! Thank you Hao AI Lab for pushing the idea forward.

thumb_up_off_alt82

chat_bubble_outline2

repeat6

shareShare

Tri Dao

@tri_dao

a month ago

Tons of effort from IBM and vLLM folks to make these hybrid models go fast. Thank you!

thumb_up_off_alt111

chat_bubble_outline0

repeat10

shareShare

Robert Shaw

@robertshaw21

a month ago

Check out Mathis Felardos and Mickaël Seznec's amazing discussion of how Mistral AI uses P/D disaggregation to optimize their vLLM deployment youtube.com/live/6m6ZE6yVE…

thumb_up_off_alt24

chat_bubble_outline1

repeat8

shareShare

Shengyuan

@shengyuans

a month ago

Hi Dzmitry, our INT4 QAT is weight-only with fake-quantization: we keep the original BF16 weights in memory, during the forward pass we on-the-fly quantize them to INT4 and immediately de-quantize back to BF16 for the actual computation. The original unquantized BF16 weight is

thumb_up_off_alt671

chat_bubble_outline21

repeat41

shareShare

vLLM

@vllm_project

a month ago

Thanks to GitHub for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year. 🏆 Top OSS by contributors 🚀 Fastest-growing by contributors 🌱 Attracting the most first-time contributors Trusted by leading open model

Thanks to <a href="/github/">GitHub</a> for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year.

🏆 Top OSS by contributors
🚀 Fastest-growing by contributors
🌱 Attracting the most first-time contributors

Trusted by leading open model

thumb_up_off_alt117

chat_bubble_outline7

repeat23

shareShare

vLLM

@vllm_project

14 days ago

vLLM is proud to support Prime Intellect 's post-training of the INTELLECT-3 model🥰

thumb_up_off_alt162

chat_bubble_outline3

repeat15

shareShare

vLLM

@vllm_project

11 days ago

More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text, image, audio, and video. Today we’re releasing vLLM-Omni: an open-source framework that extends vLLM’s easy, fast, and cost-efficient

thumb_up_off_alt276

chat_bubble_outline5

repeat45

shareShare

Red Hat AI

@redhat_ai

10 days ago

Congrats to Mistral AI on launching the Mistral 3 family under the Apache 2.0 license. We worked together to enable upstream vLLM support and collaborated on creating the FP8 and NVFP4 Mistral Large 3 checkpoints through llm-compressor for efficient deployment. 🚀

thumb_up_off_alt14

chat_bubble_outline0

repeat3

shareShare

vLLM

@vllm_project

10 days ago

🎉 Congratulations to the Mistral team on launching the Mistral 3 family! We’re proud to share that Mistral AI, NVIDIA AI Developer, Red Hat AI, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup. This collaboration enabled: • NVFP4

🎉 Congratulations to the Mistral team on launching the Mistral 3 family!

We’re proud to share that <a href="/MistralAI/">Mistral AI</a>, <a href="/NVIDIAAIDev/">NVIDIA AI Developer</a>, <a href="/RedHat_AI/">Red Hat AI</a>, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup.

This collaboration enabled:
• NVFP4

thumb_up_off_alt494

chat_bubble_outline8

repeat42

shareShare

Zach Mueller

@thezachmueller

10 days ago

Now to wait for someone to have the best OSS UI for me to run Qwen3 Omni now that vLLM-Omni exists 😍

thumb_up_off_alt35

chat_bubble_outline3

repeat4

shareShare

vLLM

@vllm_project

9 days ago

🤝 Proud to share the first production-ready vLLM plugin for Gaudi, developed in close collaboration with the Intel team and fully aligned with upstream vLLM. 🔧 This release is validated and ready for deployment, with support for the latest vLLM version coming soon. 📘 The

thumb_up_off_alt104

chat_bubble_outline4

repeat12

shareShare

vLLM

@vllm_project

9 days ago

We’re taking CUDA debugging to the next level. 🚀 Building on our previous work with CUDA Core Dumps, we are releasing a new guide on tracing hanging and complicated kernels down to the source code. As kernels get more complex (deep inlining, async memory access), standard

thumb_up_off_alt262

chat_bubble_outline1

repeat37

shareShare

llm-d

@_llm_d_

9 days ago

🚀 Announcing llm-d v0.4! This release focuses on achieving SOTA inference performance across accelerators. From ultra-low latency for MoE models to new auto-scaling capabilities, we’re pushing the boundaries of open-source inference. Blog: llm-d.ai/blog/llm-d-v0.… 🧵👇

thumb_up_off_alt8

chat_bubble_outline1

repeat3

shareShare

Red Hat AI

@redhat_ai

9 days ago

It’s vLLM party time at NeurIPS!

It’s <a href="/vllm_project/">vLLM</a> party time at NeurIPS!

thumb_up_off_alt36

chat_bubble_outline1

repeat2

shareShare