Robert Shaw (@robertshaw21) 's Twitter Profile
Robert Shaw

@robertshaw21

@redhat | @neuralmagic | @vllm_project

ID: 1524186223102676992

calendar_today11-05-2022 00:35:04

745 Tweet

455 Followers

357 Following

SemiAnalysis (@semianalysis_) 's Twitter Profile Photo

The Red Hat AI team contributes a lot to vLLM and does amazing work for the open-source community. Great to see vLLM performing so well compared to TRT-LLM on H200! vLLM comes pretty close to B200, with the NVIDIA AI team working on closing the gap for GPTOSS within the next

The <a href="/RedHat_AI/">Red Hat AI</a> team contributes a lot to vLLM and does amazing work for the open-source community. Great to see vLLM performing so well compared to TRT-LLM on H200! vLLM comes pretty close to B200, with the <a href="/NVIDIAAI/">NVIDIA AI</a> team working on closing the gap for GPTOSS within the next
vLLM (@vllm_project) 's Twitter Profile Photo

Our first official vLLM Meetup is coming to Europe on Nov 6! 🇨🇭 Meet vLLM committers Michael Goin, Tyler Michael Smith, Thomas Parnell, + speakers from Red Hat AI, IBM, Mistral AI. Topics: vLLM updates, quantization, Mistral+vLLM, hybrid models, distributed inference luma.com/0gls27kb

Kimi.ai (@kimi_moonshot) 's Twitter Profile Photo

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

vLLM (@vllm_project) 's Twitter Profile Photo

🎉 Congrats to Kimi.ai! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡

🎉 Congrats to <a href="/Kimi_Moonshot/">Kimi.ai</a>! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA):

- RULER 128k context: 84.3 perf + 3.98× speedup
- Up to 6× faster decoding &amp; 6.3× faster TPOT (1M tokens)
- 75% KV cache reduction

💡
vLLM (@vllm_project) 's Twitter Profile Photo

🚀Excited to team up with NVIDIA AI Developer to bring Nemotron Nano 2 VL to vLLM - a multimodal model powered by a hybrid Transformer–Mamba language backbone, built for video understanding and document intelligence✨ Full post here👇blog.vllm.ai/2025/10/31/run…

Hao AI Lab (@haoailab) 's Twitter Profile Photo

🔥 New Blog: “Disaggregated Inference: 18 Months Later” 18 months in LLM inference feels like a new Moore’s Law cycle – but this time not just 2x per year: 💸 Serving cost ↓10–100x 🚀 Throughput ↑10x ⚡ Latency ↓5x A big reason? Disaggregated Inference. From DistServe, our

vLLM (@vllm_project) 's Twitter Profile Photo

Love the retrospective on disaggregated inference. If you wonder where the technique named "PD" in vLLM comes from, read on! Thank you Hao AI Lab for pushing the idea forward.

Robert Shaw (@robertshaw21) 's Twitter Profile Photo

Check out Mathis Felardos and Mickaël Seznec's amazing discussion of how Mistral AI uses P/D disaggregation to optimize their vLLM deployment youtube.com/live/6m6ZE6yVE…

Shengyuan (@shengyuans) 's Twitter Profile Photo

Hi Dzmitry, our INT4 QAT is weight-only with fake-quantization: we keep the original BF16 weights in memory, during the forward pass we on-the-fly quantize them to INT4 and immediately de-quantize back to BF16 for the actual computation. The original unquantized BF16 weight is

vLLM (@vllm_project) 's Twitter Profile Photo

Thanks to GitHub for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year. 🏆 Top OSS by contributors 🚀 Fastest-growing by contributors 🌱 Attracting the most first-time contributors Trusted by leading open model

Thanks to <a href="/github/">GitHub</a> for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year.

🏆 Top OSS by contributors
🚀 Fastest-growing by contributors
🌱 Attracting the most first-time contributors

Trusted by leading open model
vLLM (@vllm_project) 's Twitter Profile Photo

More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text, image, audio, and video. Today we’re releasing vLLM-Omni: an open-source framework that extends vLLM’s easy, fast, and cost-efficient

Red Hat AI (@redhat_ai) 's Twitter Profile Photo

Congrats to Mistral AI on launching the Mistral 3 family under the Apache 2.0 license. We worked together to enable upstream vLLM support and collaborated on creating the FP8 and NVFP4 Mistral Large 3 checkpoints through llm-compressor for efficient deployment. 🚀

vLLM (@vllm_project) 's Twitter Profile Photo

🎉 Congratulations to the Mistral team on launching the Mistral 3 family! We’re proud to share that Mistral AI, NVIDIA AI Developer, Red Hat AI, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup. This collaboration enabled: • NVFP4

🎉 Congratulations to the Mistral team on launching the Mistral 3 family!

We’re proud to share that <a href="/MistralAI/">Mistral AI</a>, <a href="/NVIDIAAIDev/">NVIDIA AI Developer</a>, <a href="/RedHat_AI/">Red Hat AI</a>, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup.

This collaboration enabled:
• NVFP4
vLLM (@vllm_project) 's Twitter Profile Photo

🤝 Proud to share the first production-ready vLLM plugin for Gaudi, developed in close collaboration with the Intel team and fully aligned with upstream vLLM. 🔧 This release is validated and ready for deployment, with support for the latest vLLM version coming soon. 📘 The

🤝 Proud to share the first production-ready vLLM plugin for Gaudi, developed in close collaboration with the Intel team and fully aligned with upstream vLLM.

🔧 This release is validated and ready for deployment, with support for the latest vLLM version coming soon.
📘 The
vLLM (@vllm_project) 's Twitter Profile Photo

We’re taking CUDA debugging to the next level. 🚀 Building on our previous work with CUDA Core Dumps, we are releasing a new guide on tracing hanging and complicated kernels down to the source code. As kernels get more complex (deep inlining, async memory access), standard

We’re taking CUDA debugging to the next level. 🚀

Building on our previous work with CUDA Core Dumps, we are releasing a new guide on tracing hanging and complicated kernels down to the source code.

As kernels get more complex (deep inlining, async memory access), standard
llm-d (@_llm_d_) 's Twitter Profile Photo

🚀 Announcing llm-d v0.4! This release focuses on achieving SOTA inference performance across accelerators. From ultra-low latency for MoE models to new auto-scaling capabilities, we’re pushing the boundaries of open-source inference. Blog: llm-d.ai/blog/llm-d-v0.… 🧵👇

🚀 Announcing llm-d v0.4!

This release focuses on achieving SOTA inference performance across accelerators. 

From ultra-low latency for MoE models to new auto-scaling capabilities, we’re pushing the boundaries of open-source inference.

Blog: llm-d.ai/blog/llm-d-v0.…

🧵👇