Cody Yu (@codyhaoyu) Twitter Tweets • TwiCopy

Anyscale

2 years ago

🦙 We're excited to host Meta Llama-3 8b and 70b on Anyscale Endpoints! ➕ Fine-tuning, JSON mode and function calling support coming soon as well! Pricing: - 8B: $0.15 / Million tokens - 70B: $1.00 / Million tokens

🦙 We're excited to host <a href="/Meta/">Meta</a> Llama-3 8b and 70b on Anyscale Endpoints!

➕ Fine-tuning, JSON mode and function calling support coming soon as well!

Pricing:
- 8B: $0.15 / Million tokens
- 70B: $1.00 / Million tokens

thumb_up_off_alt55

chat_bubble_outline4

repeat11

shareShare

Yangqing Jia

@jiayq

2 years ago

I applaud OctoAI's newest achievement, but clear fact checking is really needed. The blog post's title figure shows "multi-user throughput" and then proceeds to say "concurrency=1". What? I know we are all competing on LLM inference speeds, but let me say this: vLLM is a solid

thumb_up_off_alt25

chat_bubble_outline3

repeat6

shareShare

Anyscale

@anyscalecompute

2 years ago

Recently, we’ve contributed chunked prefill to vLLM, leading to up to 2x speedup for higher QPS regimes! In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n

Recently, we’ve contributed chunked prefill to <a href="/vllm_project/">vLLM</a>, leading to up to 2x speedup for higher QPS regimes!

In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n

thumb_up_off_alt94

chat_bubble_outline4

repeat22

shareShare

Cade Daniel 🇺🇸

@cdnamz

2 years ago

We've great projects at Anyscale, come work with us. We've shipped: • Chunked prefill Sang Cho • Multi-LoRA Antoni Baum • Dynamic spec decode Lily Liu • FP8 Cody Yu • MoE optimization Philipp Moritz • Ray, dist. compute framework used to train ChatGPT Robert Nishihara et al

thumb_up_off_alt34

chat_bubble_outline3

repeat9

shareShare

Anyscale

@anyscalecompute

2 years ago

There has been so much excitement and activity around this topic, that we are adding a vLLM track to the Ray Summit! If you contribute to or use vLLM, we want to hear from you. raysummit.anyscale.com

There has been so much excitement and activity around this topic, that we are adding a vLLM track to the Ray Summit!

If you contribute to or use <a href="/vllm_project/">vLLM</a>, we want to hear from you.

raysummit.anyscale.com

thumb_up_off_alt33

chat_bubble_outline1

repeat11

shareShare

Anyscale

@anyscalecompute

2 years ago

We’ve recently contributed FP8 support to the vLLM in collaboration with @neuralmagic. With this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! 1/n

We’ve recently contributed FP8 support to the <a href="/vllm_project/">vLLM</a> in collaboration with @neuralmagic.

With this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation!

1/n

thumb_up_off_alt105

chat_bubble_outline2

repeat32

shareShare

Red Hat AI

@redhat_ai

2 years ago

Our latest vLLM Office Hours recording is ready! We dive deep into FP8 quantization with vLLM Committers from Anyscale and share the newest updates in vLLM v0.5.1. youtu.be/GLqsETc8aTc

thumb_up_off_alt12

chat_bubble_outline1

repeat9

shareShare

Red Hat AI

@redhat_ai

2 years ago

EXCITING NEWS: Neural Magic and Anyscale contributed FP8 quantization support to the vLLM, making LLM inference more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Cheers to NVIDIA AI Developer for validating our results. 1/6

EXCITING NEWS: Neural Magic and <a href="/anyscalecompute/">Anyscale</a> contributed FP8 quantization support to the <a href="/vllm_project/">vLLM</a>, making LLM inference more efficient.

FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation.

Cheers to <a href="/NVIDIAAIDev/">NVIDIA AI Developer</a> for validating our results.

1/6

thumb_up_off_alt84

chat_bubble_outline2

repeat21

shareShare

vLLM

@vllm_project

2 years ago

We are excited to invite everyone to our 5th meetup with AWS on July 24 in SF (next Wed!). The team will share recent progress in FP8, pipeline parallel, and various work on perf. The space is limited to 150 so plz register ASAP! lu.ma/lp0gyjqr

thumb_up_off_alt18

chat_bubble_outline1

repeat8

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

2 years ago

We are thrilled to announce the milestone release of SGLang Runtime v0.2, featuring significant inference optimizations after months of hard work. It achieves up to 2.1x higher throughput compared to TRT-LLM and up to 3.8x higher throughput compared to vLLM. It consistently

thumb_up_off_alt526

chat_bubble_outline13

repeat125

shareShare

vLLM

@vllm_project

2 years ago

A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. blog.vllm.ai/2024/09/05/per…

thumb_up_off_alt378

chat_bubble_outline14

repeat69

shareShare

vLLM

@vllm_project

2 years ago

We are beyond excited to partner with @Anyscale in #RaySummit 2024 for a dedicated vLLM track featuring 10+ sessions about vLLM use cases and features! If you are AI/ML developers exploring and/or using vLLM, highly recommend attending these sessions. raysummit.anyscale.com

thumb_up_off_alt45

chat_bubble_outline3

repeat9

shareShare

Zihao Ye

@ye_combinator

a year ago

We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and

thumb_up_off_alt163

chat_bubble_outline6

repeat41

shareShare

Roger Wang

@rogerw0108

a year ago

Sneak peek of what I've been working on with Cody (Cody Yu), Alex (github.com/alexm-neuralma…) and a few others. Still a lot of room for improvement. Easier, faster, cheaper :)

Sneak peek of what I've been working on with Cody (<a href="/CodyHaoYu/">Cody Yu</a>), Alex (github.com/alexm-neuralma…) and a few others. Still a lot of room for improvement.

Easier, faster, cheaper :)

thumb_up_off_alt20

chat_bubble_outline1

repeat2

shareShare

vLLM

@vllm_project

a year ago

🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.

thumb_up_off_alt646

chat_bubble_outline15

repeat96

shareShare

vLLM

@vllm_project

a year ago

We are excited to invite you to our Menlo Park meetup with Meta, evening of Thursday, February 27! Meta engineers will discuss the improvements on top of vLLM, and committer Cody Yu will share updates from the v0.7.x series of releases. lu.ma/h7g3kuj9

thumb_up_off_alt53

chat_bubble_outline1

repeat15

shareShare

kourosh hakhamaneshi

@cyrushakha

a year ago

Announcing native LLM APIs in ray Ray Data and Ray Serve Libraries. These are experimental APIs we are announcing today that abstract two things: 1. Serve LLM: simplifies the deployment of LLM engines (e.g. vLLM) through ray serve APIs. Enables things like

thumb_up_off_alt36

chat_bubble_outline0

repeat7

shareShare