EmbeddedLLM (@embeddedllm) 's Twitter Profile
EmbeddedLLM

@embeddedllm

Your open-source AI ally. We specialize in integrating LLM into your business.

ID: 1716394660636295168

calendar_today23-10-2023 10:02:43

303 Tweet

621 Takipçi

1,1K Takip Edilen

Red Hat AI (@redhat_ai) 's Twitter Profile Photo

BIG NEWS! 🎉 Compressed Tensors is officially joining the vLLM! Built on top of the excellent Hugging Face safetensors framework, Compressed Tensors extends it with efficient storage and management of compressed tensor data for model quantization and sparsity. Why

vLLM (@vllm_project) 's Twitter Profile Photo

💡 vLLM @ Open Source AI Week! 1⃣ Wednesday, Oct 23 & Thursday, Oct 24: vLLM @ Pytorch Conference 2025 🚀 Explore vLLM at PyTorch Conference 2025! 📅 Sessions to catch: 1. Easy, Fast, Cheap LLM Serving for Everyone – Simon Mo, Room 2004/2006 2. Open Source Post-Training Stack:

💡 vLLM @ Open Source AI Week!
1⃣ Wednesday, Oct 23 & Thursday, Oct 24: vLLM @ Pytorch Conference 2025
🚀 Explore vLLM at PyTorch Conference 2025!
📅 Sessions to catch:
1. Easy, Fast, Cheap LLM Serving for Everyone – Simon Mo, Room 2004/2006
2. Open Source Post-Training Stack:
vLLM (@vllm_project) 's Twitter Profile Photo

We are excited about an open ABI and FFI for ML Systems from Tianqi Chen. In our experience with vLLM, such interop layer is definitely needed!

vLLM (@vllm_project) 's Twitter Profile Photo

kvcached works directly with vLLM and you can use it to serve multiple models on the same GPU. They will share unused KV cache blocks. Check it out!

vLLM (@vllm_project) 's Twitter Profile Photo

🚀 Excited to share our work on batch-invariant inference in vLLM! Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1 No more subtle differences between bs=1 and bs=N (including prefill!). Let's dive into how we built this 🧵👇

🚀 Excited to share our work on batch-invariant inference in vLLM! 
Now you can get identical results regardless of batch size with just one flag: VLLM_BATCH_INVARIANT=1
No more subtle differences between bs=1 and bs=N (including prefill!). Let's dive into how we built this 🧵👇
EmbeddedLLM (@embeddedllm) 's Twitter Profile Photo

Catch us in the special vLLM track at Anyscale Ray Summit 2025 SF Marriott Marquis vLLM Deep Dive: Architecture, Performance & Contributing by Tan TJian Can’t wait to see all vLLM friends. anyscale.com/ray-summit/202…

Catch us in the special vLLM track at <a href="/anyscalecompute/">Anyscale</a> Ray Summit 2025 SF Marriott Marquis
vLLM Deep Dive: Architecture, Performance &amp; Contributing by <a href="/Rxday000/">Tan TJian</a>
Can’t wait to see all <a href="/vllm_project/">vLLM</a> friends. anyscale.com/ray-summit/202…
vLLM (@vllm_project) 's Twitter Profile Photo

vLLM Sleep Mode 😴→ ⚡Zero-reload model switching for multi-model serving. Benchmarks: 18–200× faster switches and 61–88% faster first inference vs cold starts. Explanation Blog by EmbeddedLLM 👇 Why it’s fast: we keep the process alive, preserving the allocator, CUDA graphs,

vLLM Sleep Mode 😴→ ⚡Zero-reload model switching for multi-model serving. Benchmarks: 18–200× faster switches and 61–88% faster first inference vs cold starts. Explanation Blog by <a href="/EmbeddedLLM/">EmbeddedLLM</a> 👇
Why it’s fast: we keep the process alive, preserving the allocator, CUDA graphs,
vLLM (@vllm_project) 's Twitter Profile Photo

🔥 Following our big announcement — here’s the full vLLM takeover at Ray Summit 2025! 📍 San Francisco • Nov 3–5 • Hosted by Anyscale Get ready for deep dives into high-performance inference, unified backends, prefix caching, MoE serving, and large-scale

vLLM (@vllm_project) 's Twitter Profile Photo

🎉 Congrats to Kimi.ai! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡

🎉 Congrats to <a href="/Kimi_Moonshot/">Kimi.ai</a>! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA):

- RULER 128k context: 84.3 perf + 3.98× speedup
- Up to 6× faster decoding &amp; 6.3× faster TPOT (1M tokens)
- 75% KV cache reduction

💡
Roger Wang (@rogerw0108) 's Twitter Profile Photo

There are actually quite a few optimizations specific to multimodal other than just faster kernels, and I didn’t pull those all nighters for nothing🥲

vLLM (@vllm_project) 's Twitter Profile Photo

Wow excited to see PewDiePie using vLLM to serve language models locally 😃 vLLM brings easy, fast, and cheap LLM serving for everyone 🥰

vLLM (@vllm_project) 's Twitter Profile Photo

🔥Highly requested by the community, PaddleOCR-VL is now officially supported on vLLM! 🚀 Check out our recipe for this model to get started!👇docs.vllm.ai/projects/recip…

vLLM (@vllm_project) 's Twitter Profile Photo

Amazing work by Rui-Jie (Ridger) Zhu and the ByteDance Seed team — Scaling Latent Reasoning via Looped LMs introduces looped reasoning as a new scaling dimension. 🔥 The Ouro model is now runnable on vLLM (nightly version) — bringing efficient inference to this new paradigm of latent

vLLM (@vllm_project) 's Twitter Profile Photo

สวัสดีครับ Sawadekap, Bangkok! พร้อมจะโกลว์กันหรือยัง? ✨ vLLM Meetup — 21 Nov 2025 Hosted by EmbeddedLLM, AMD & Red Hat Members from the vLLM maintainer team will join us to share their latest insights and roadmap — straight from the source! We've also invited local Thai

vLLM (@vllm_project) 's Twitter Profile Photo

Thanks to GitHub for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year. 🏆 Top OSS by contributors 🚀 Fastest-growing by contributors 🌱 Attracting the most first-time contributors Trusted by leading open model

Thanks to <a href="/github/">GitHub</a> for spotlighting vLLM in the Octoverse 2025 report — one of the fastest-growing open-source AI projects this year.

🏆 Top OSS by contributors
🚀 Fastest-growing by contributors
🌱 Attracting the most first-time contributors

Trusted by leading open model
EmbeddedLLM (@embeddedllm) 's Twitter Profile Photo

Big night at the vLLM × Meta × AMD meetup in Palo Alto 💥 So fun hanging out IRL with fellow vLLM Woosuk Kwon, Simon Mo and the AMD crew Anush Elangovan and Ramine Roane. Bonus: heading home with a signed AMD Radeon PRO AI Pro R9700 to squeeze even more tokens/sec out of AMD

Big night at the vLLM × Meta × AMD meetup in Palo Alto 💥
So fun hanging out IRL with fellow <a href="/vllm_project/">vLLM</a>  <a href="/woosuk_k/">Woosuk Kwon</a>, <a href="/simon_mo_/">Simon Mo</a> and the <a href="/AMD/">AMD</a> crew <a href="/AnushElangovan/">Anush Elangovan</a> and <a href="/roaner/">Ramine Roane</a>.

Bonus: heading home with a signed <a href="/RadeonPRO/">AMD Radeon PRO</a> AI Pro R9700 to squeeze even more tokens/sec out of AMD