Cody Yu (@codyhaoyu) 's Twitter Profile
Cody Yu

@codyhaoyu

MLSys, LLM Serving, Deep Learning Compiler

ID: 836647857490804736

linkhttps://www.linkedin.com/in/cody-hao-yu calendar_today28-02-2017 18:42:48

106 Tweet

176 Followers

27 Following

Anyscale (@anyscalecompute) 's Twitter Profile Photo

🦙 We're excited to host Meta Llama-3 8b and 70b on Anyscale Endpoints! ➕ Fine-tuning, JSON mode and function calling support coming soon as well! Pricing: - 8B: $0.15 / Million tokens - 70B: $1.00 / Million tokens

🦙 We're excited to host <a href="/Meta/">Meta</a> Llama-3 8b and 70b on Anyscale Endpoints! 

âž• Fine-tuning, JSON mode and function calling support coming soon as well!

Pricing: 
- 8B: $0.15 / Million tokens
- 70B: $1.00 / Million tokens
Yangqing Jia (@jiayq) 's Twitter Profile Photo

I applaud OctoAI's newest achievement, but clear fact checking is really needed. The blog post's title figure shows "multi-user throughput" and then proceeds to say "concurrency=1". What? I know we are all competing on LLM inference speeds, but let me say this: vLLM is a solid

Anyscale (@anyscalecompute) 's Twitter Profile Photo

Recently, we’ve contributed chunked prefill to vLLM, leading to up to 2x speedup for higher QPS regimes! In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n

Recently, we’ve contributed chunked prefill to <a href="/vllm_project/">vLLM</a>, leading to up to 2x speedup for higher QPS regimes!

In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n
Cade Daniel 🇺🇸 (@cdnamz) 's Twitter Profile Photo

We've great projects at Anyscale, come work with us. We've shipped: • Chunked prefill Sang Cho • Multi-LoRA Antoni Baum • Dynamic spec decode Lily Liu • FP8 Cody Yu • MoE optimization Philipp Moritz • Ray, dist. compute framework used to train ChatGPT Robert Nishihara et al

Anyscale (@anyscalecompute) 's Twitter Profile Photo

There has been so much excitement and activity around this topic, that we are adding a vLLM track to the Ray Summit! If you contribute to or use vLLM, we want to hear from you. raysummit.anyscale.com

There has been so much excitement and activity around this topic, that we are adding a vLLM track to the Ray Summit!

If you contribute to or use <a href="/vllm_project/">vLLM</a>, we want to hear from you.

raysummit.anyscale.com
Anyscale (@anyscalecompute) 's Twitter Profile Photo

We’ve recently contributed FP8 support to the vLLM in collaboration with @neuralmagic. With this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! 1/n

We’ve recently contributed FP8 support to the <a href="/vllm_project/">vLLM</a> in collaboration with @neuralmagic.

With this feature, you can see up to a 1.8x reduction in inter-token latency, with &gt;99% accuracy preservation! 

1/n
Red Hat AI (@redhat_ai) 's Twitter Profile Photo

Our latest vLLM Office Hours recording is ready! We dive deep into FP8 quantization with vLLM Committers from Anyscale and share the newest updates in vLLM v0.5.1. youtu.be/GLqsETc8aTc

Red Hat AI (@redhat_ai) 's Twitter Profile Photo

EXCITING NEWS: Neural Magic and Anyscale contributed FP8 quantization support to the vLLM, making LLM inference more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Cheers to NVIDIA AI Developer for validating our results. 1/6

EXCITING NEWS: Neural Magic and <a href="/anyscalecompute/">Anyscale</a> contributed FP8 quantization support to the <a href="/vllm_project/">vLLM</a>, making LLM inference more efficient.

FP8 reduces latency on NVIDIA GPUs by 2x with &gt;99% accuracy preservation.

Cheers to <a href="/NVIDIAAIDev/">NVIDIA AI Developer</a> for validating our results.

1/6
vLLM (@vllm_project) 's Twitter Profile Photo

We are excited to invite everyone to our 5th meetup with AWS on July 24 in SF (next Wed!). The team will share recent progress in FP8, pipeline parallel, and various work on perf. The space is limited to 150 so plz register ASAP! lu.ma/lp0gyjqr

lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

We are thrilled to announce the milestone release of SGLang Runtime v0.2, featuring significant inference optimizations after months of hard work. It achieves up to 2.1x higher throughput compared to TRT-LLM and up to 3.8x higher throughput compared to vLLM. It consistently

We are thrilled to announce the milestone release of SGLang Runtime v0.2, featuring significant inference optimizations after months of hard work.

It achieves up to 2.1x higher throughput compared to TRT-LLM and up to 3.8x higher throughput compared to vLLM. It consistently
vLLM (@vllm_project) 's Twitter Profile Photo

A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. blog.vllm.ai/2024/09/05/per…

vLLM (@vllm_project) 's Twitter Profile Photo

We are beyond excited to partner with @Anyscale in #RaySummit 2024 for a dedicated vLLM track featuring 10+ sessions about vLLM use cases and features! If you are AI/ML developers exploring and/or using vLLM, highly recommend attending these sessions. raysummit.anyscale.com

We are beyond excited to partner with @Anyscale in #RaySummit 2024 for a dedicated vLLM track featuring 10+ sessions about vLLM use cases and features! If you are AI/ML developers exploring and/or using vLLM, highly recommend attending these sessions.

raysummit.anyscale.com
Zihao Ye (@ye_combinator) 's Twitter Profile Photo

We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and

We are excite to announce FlashInfer v0.2!

Core contributions of this release include:
- Block/Vector  Sparse (Paged) Attention on FlashAttention-3 
- JIT compilation for customized attention variants
- Fused Multi-head Latent Attention (MLA) decoding kernel
- Lots of bugfix and
Roger Wang (@rogerw0108) 's Twitter Profile Photo

Sneak peek of what I've been working on with Cody (Cody Yu), Alex (github.com/alexm-neuralma…) and a few others. Still a lot of room for improvement. Easier, faster, cheaper :)

Sneak peek of what I've been working on with Cody (<a href="/CodyHaoYu/">Cody Yu</a>), Alex (github.com/alexm-neuralma…) and a few others. Still a lot of room for improvement.

Easier, faster, cheaper :)
vLLM (@vllm_project) 's Twitter Profile Photo

🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.

🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! 
Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.
vLLM (@vllm_project) 's Twitter Profile Photo

We are excited to invite you to our Menlo Park meetup with Meta, evening of Thursday, February 27! Meta engineers will discuss the improvements on top of vLLM, and committer Cody Yu will share updates from the v0.7.x series of releases. lu.ma/h7g3kuj9

kourosh hakhamaneshi (@cyrushakha) 's Twitter Profile Photo

Announcing native LLM APIs in ray Ray Data and Ray Serve Libraries. These are experimental APIs we are announcing today that abstract two things: 1. Serve LLM: simplifies the deployment of LLM engines (e.g. vLLM) through ray serve APIs. Enables things like