Huiqiang Jiang (@iofu728) Twitter Tweets • TwiCopy

elvis

a year ago

A KV Cache-Centric Analysis of Long-Context Methods Evaluates long-context methods from a KV cache-centric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. The paper reports some interesting findings. For instance, they

thumb_up_off_alt118

chat_bubble_outline1

repeat35

shareShare

Huiqiang Jiang

@iofu728

10 months ago

SCBench has been accepted by #ICLR2025! Now you can evaluate your long-context methods across the full KV cache lifecycle. Congratulations to Yucheng Li and all my co-authors! Find more details at aka.ms/SCBench

thumb_up_off_alt7

chat_bubble_outline0

repeat3

shareShare

Huiqiang Jiang

@iofu728

10 months ago

Great work!🚀 Exciting to see MInference being deployed on servers, and also open-source the integration of chunk pre-filling, dynamic sparsity, and DCA, along with the vLLM implement! Thank you for your incredible work🥁

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Huiqiang Jiang

@iofu728

10 months ago

I believe that Long-CoT RL could be a promising approach to achieving truly effective long-context windows.

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

Huiqiang Jiang

@iofu728

9 months ago

🚀 Great work! Excited to see dynamic sparse attention proving its effectiveness at a large scale!

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Huiqiang Jiang

@iofu728

9 months ago

🔥 Excellent work on dynamic sparse attention, especially its performance improvement in long CoT! Looking forward to your next release!

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Aran Komatsuzaki

@arankomatsuzaki

7 months ago

Microsoft presents MMInference - Accelerates pre-filling for long-context VLMs via modality-aware permutation - Accelerates 8.3x at 1M tokens while maintaining accuracy

thumb_up_off_alt133

chat_bubble_outline5

repeat25

shareShare

Huiqiang Jiang

@iofu728

7 months ago

✈️to ICLR'25. Looking forward to meeting you all and discussing efficient LLMs. #ICLR25 - (Apr. 24 10:00 # 291) SCBench aka.ms/SCBench - (Apr. 25 13:30-14 Microsoft Booth) Efficient Long-context Methods - (Apr. 26 15:00-17:30 #58) SeCom aka.ms/SeCom

thumb_up_off_alt17

chat_bubble_outline1

repeat5

shareShare

Huiqiang Jiang

@iofu728

7 months ago

Thanks Aran Komatsuzaki for the promotion! It, a bottom-up system-algorithm co-design sparse attention methods, can process 1M tokens video 8.3x faster using Long-context VLMs. And we'll present it at ICLR'25 Microsoft Booth (Apr. 25 13:30) aka.ms/MMInference

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Piotr Nawrot

@p_nawrot

7 months ago

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

thumb_up_off_alt596

chat_bubble_outline5

repeat102

shareShare

Hanshi Sun

@preminstrel

7 months ago

🎉 Thrilled to announce our ShadowKV has been accepted to #ICML2025 as a ✨Spotlight Presentation❗️ ❓Facing challenges with high-throughput long-context LLM serving? ShadowKV is here to help! 🚀 Achieves memory-efficient & high-throughput inference via sparse attention. 🌟

thumb_up_off_alt31

chat_bubble_outline0

repeat12

shareShare

Huiqiang Jiang

@iofu728

7 months ago

MMInference is accepted by #ICML2025! It use permutation to solve inductive bias and modality boundary issues in multi-modality. And also unify dynamic sparse attention in sparse load + dense tensor core pipeline. Congratulations to Yucheng Li! Find more aka.ms/MMInference

thumb_up_off_alt41

chat_bubble_outline0

repeat5

shareShare

Huiqiang Jiang

@iofu728

7 months ago

Deep thanks to SGLang LMSYS Org and Qwen for their generous support—especially zhyncs. We hope this work proves valuable in real-world applications.

thumb_up_off_alt15

chat_bubble_outline0

repeat2

shareShare

Cognition

@cognition_labs

7 months ago

Our research interns present: Kevin-32B = K(ernel D)evin It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset. It outperforms top reasoning models (o3 & o4-mini)! 🧵

thumb_up_off_alt1,1K

chat_bubble_outline45

repeat190

shareShare

Huiqiang Jiang

@iofu728

7 months ago

🎊Congratulation! See you in shanghai!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare