Lequn Chen (@abcdabcd987) 's Twitter Profile
Lequn Chen

@abcdabcd987

Faster and cheaper LLM inference.

ID: 470747166

linkhttps://abcdabcd987.com calendar_today22-01-2012 03:26:52

25 Tweet

893 Followers

559 Following

Lequn Chen (@abcdabcd987) 's Twitter Profile Photo

Just made a demo: use Punica to serve multiple LoRA finetuned LLMs at the cost of one! Previously: x.com/abcdabcd987/st…

Lequn Chen (@abcdabcd987) 's Twitter Profile Photo

Really good observation from Tianle Cai and Junru Shao . I did a quick sanity check. Delta between Mixtral 8x7B MoE and Mistral 7B is NOT low-rank. SGMV is not applicable here. We need new research :)

Really good observation from <a href="/tianle_cai/">Tianle Cai</a> and <a href="/junrushao/">Junru Shao</a> . I did a quick sanity check. Delta between Mixtral 8x7B MoE and Mistral 7B is NOT low-rank. SGMV is not applicable here. We need new research :)
Baris Kasikci (@bariskasikci) 's Twitter Profile Photo

We just released the source code for Atom, an efficient and accurate quantization algorithm for Large Language Model serving: github.com/efeslab/Atom. Atom provides up to 7x throughput improvements while maintaining great accuracy.

We just released the source code for Atom, an efficient and accurate quantization algorithm for Large Language Model serving: github.com/efeslab/Atom. Atom provides up to 7x throughput improvements while maintaining great accuracy.
Zihao Ye (@ye_combinator) 's Twitter Profile Photo

(1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: blog: flashinfer.ai/2024/02/02/cas… Trying out our APIs: docs.flashinfer.ai/api/python/cas…

(1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: 

blog: flashinfer.ai/2024/02/02/cas…
Trying out our APIs: docs.flashinfer.ai/api/python/cas…
Lequn Chen (@abcdabcd987) 's Twitter Profile Photo

🚀FlashInfer: Highly optimized Attention kernel for {single, batch} x {prefill, decode, append} x {ragged tensor, paging} x {FP16, FP8, INT4} x {4090, Ada6000, A100, H100} 🔥Python Wheels available! Check it out!

Perplexity (@perplexity_ai) 's Twitter Profile Photo

We’re excited to announce an updated version of our Pro Search that can perform deeper research on more complex queries with multi-step reasoning, Wolfram|Alpha, and code execution.

Lequn Chen (@abcdabcd987) 's Twitter Profile Photo

We are building our in-house LLM inference stack. Join us if this excites you! And, I have a more in-depth tutorial about achieving 3200 Gbps here: le.qun.ch/en/blog/2024/1…

Lequn Chen (@abcdabcd987) 's Twitter Profile Photo

10x faster than PyTorch All-to-All. 2x faster than DeepEP on single node. Although 2x slower than DeepEP on 128 GPUs, our impl is less picky about hardware and software. Make your MoE go brrr github.com/ppl-ai/pplx-ke…

NVIDIA AI Developer (@nvidiaaidev) 's Twitter Profile Photo

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and

Lequn Chen (@abcdabcd987) 's Twitter Profile Photo

I prefer this UI (win2003 even better) to today's UI. Today's UI feels inconsistent, whitespace is too big, info is hidden in nested menus. Screen and resolution gets bigger and bigger, but information density gets lower and lower.