Lequn Chen (@abcdabcd987) Twitter Tweets • TwiCopy

Lequn Chen

2 years ago

Just made a demo: use Punica to serve multiple LoRA finetuned LLMs at the cost of one! Previously: x.com/abcdabcd987/st…

thumb_up_off_alt35

chat_bubble_outline2

repeat3

shareShare

Really good observation from Tianle Cai and Junru Shao . I did a quick sanity check. Delta between Mixtral 8x7B MoE and Mistral 7B is NOT low-rank. SGMV is not applicable here. We need new research :)

Really good observation from <a href="/tianle_cai/">Tianle Cai</a> and <a href="/junrushao/">Junru Shao</a> . I did a quick sanity check. Delta between Mixtral 8x7B MoE and Mistral 7B is NOT low-rank. SGMV is not applicable here. We need new research :)

thumb_up_off_alt7

chat_bubble_outline1

repeat0

shareShare

Baris Kasikci

@bariskasikci

2 years ago

We just released the source code for Atom, an efficient and accurate quantization algorithm for Large Language Model serving: github.com/efeslab/Atom. Atom provides up to 7x throughput improvements while maintaining great accuracy.

thumb_up_off_alt219

chat_bubble_outline3

repeat43

shareShare

Zihao Ye

@ye_combinator

2 years ago

(1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: blog: flashinfer.ai/2024/02/02/cas… Trying out our APIs: docs.flashinfer.ai/api/python/cas…

thumb_up_off_alt95

chat_bubble_outline1

repeat21

shareShare

Lequn Chen

@abcdabcd987

2 years ago

🚀FlashInfer: Highly optimized Attention kernel for {single, batch} x {prefill, decode, append} x {ragged tensor, paging} x {FP16, FP8, INT4} x {4090, Ada6000, A100, H100} 🔥Python Wheels available! Check it out!

thumb_up_off_alt16

chat_bubble_outline0

repeat2

shareShare

Luis Ceze

@luisceze

2 years ago

Go Lequn Chen (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with Zihao Ye Arvind Krishnamurthy and others! #mlsys24 arxiv.org/abs/2310.18547

Go <a href="/abcdabcd987/">Lequn Chen</a> (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with <a href="/ye_combinator/">Zihao Ye</a> <a href="/arvind_uw/">Arvind Krishnamurthy</a> and others! #mlsys24 arxiv.org/abs/2310.18547

thumb_up_off_alt20

chat_bubble_outline0

repeat2

shareShare

Perplexity

@perplexity_ai

a year ago

thumb_up_off_alt611

chat_bubble_outline30

repeat61

shareShare

Perplexity

@perplexity_ai

a year ago

We’re excited to announce an updated version of our Pro Search that can perform deeper research on more complex queries with multi-step reasoning, Wolfram|Alpha, and code execution.

thumb_up_off_alt1,1K

chat_bubble_outline75

repeat191

shareShare

Perplexity

@perplexity_ai

a year ago

Ask Jim Harbaugh anything.

thumb_up_off_alt1,1K

chat_bubble_outline98

repeat184

shareShare

Lequn Chen

@abcdabcd987

a year ago

Start a new year's work with coffee in a Perplexity mug!

thumb_up_off_alt42

chat_bubble_outline1

repeat0

shareShare

Lequn Chen

@abcdabcd987

9 months ago

We are building our in-house LLM inference stack. Join us if this excites you! And, I have a more in-depth tutorial about achieving 3200 Gbps here: le.qun.ch/en/blog/2024/1…

thumb_up_off_alt255

chat_bubble_outline3

repeat23

shareShare

Lequn Chen

@abcdabcd987

8 months ago

10x faster than PyTorch All-to-All. 2x faster than DeepEP on single node. Although 2x slower than DeepEP on 128 GPUs, our impl is less picky about hardware and software. Make your MoE go brrr github.com/ppl-ai/pplx-ke…

thumb_up_off_alt153

chat_bubble_outline0

repeat17

shareShare

Lequn Chen

@abcdabcd987

7 months ago

Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.

thumb_up_off_alt31

chat_bubble_outline0

repeat8

shareShare

Lequn Chen

@abcdabcd987

7 months ago

It has been such a wonderful year at Perplexity. Keep building 😆

It has been such a wonderful year at <a href="/perplexity_ai/">Perplexity</a>. Keep building 😆

thumb_up_off_alt902

chat_bubble_outline14

repeat3

shareShare

NVIDIA AI Developer

@nvidiaaidev

6 months ago

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and

thumb_up_off_alt202

chat_bubble_outline4

repeat46

shareShare

Lequn Chen

@abcdabcd987

6 months ago

I prefer this UI (win2003 even better) to today's UI. Today's UI feels inconsistent, whitespace is too big, info is hidden in nested menus. Screen and resolution gets bigger and bigger, but information density gets lower and lower.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Tianqi Chen

@tqchenml

5 months ago

Checkout the technical deep dive on FlashInfer

thumb_up_off_alt28

chat_bubble_outline0

repeat4

shareShare