Yixin Dong (@yi_xin_dong) 's Twitter Profile
Yixin Dong

@yi_xin_dong

Ph.D. student @SCSatCMU, prev @deepseek_ai, @uwcse, @sjtu1896. @ApacheTVM contributor. Working on ML and systems. All views are my own

ID: 1028438483365490688

linkhttps://github.com/Ubospica calendar_today12-08-2018 00:30:16

68 Tweet

400 Followers

544 Following

Xinyu Yang (@xinyu2ml) 's Twitter Profile Photo

We will be presenting "APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding", a novel encoding method that enables: 🚀Pre-caching Contexts for Fast Inference 🐍Re-using Positions for Long Context Our poster session is located in Hall 3 and Hall 2B,

We will be presenting "APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding", a novel encoding method that enables:
🚀Pre-caching Contexts for Fast Inference
🐍Re-using Positions for Long Context

Our poster session is located in Hall 3 and Hall 2B,
Muyang Li (@lmxyy1999) 's Twitter Profile Photo

🚀 How to run 12B FLUX.1 on your local laptop with 2-3× speedup? Come check out our #SVDQuant (#ICLR2025 Spotlight) poster session! 🎉 🗓️ When: Friday, Apr 25, 10–12:30 (Singapore time) 📍 Where: Hall 3 + Hall 2B, Poster 169 📌 Poster: tinyurl.com/poster-svdquant 🎮 Demo:

🚀 How to run 12B FLUX.1 on your local laptop with 2-3× speedup? Come check out our #SVDQuant (#ICLR2025 Spotlight) poster session! 🎉 
🗓️ When: Friday, Apr 25, 10–12:30 (Singapore time)
📍 Where: Hall 3 + Hall 2B, Poster 169
📌 Poster: tinyurl.com/poster-svdquant
🎮 Demo:
Cognition (@cognition_labs) 's Twitter Profile Photo

Project DeepWiki Up-to-date documentation you can talk to, for every repo in the world. Think Deep Research for GitHub – powered by Devin. It’s free for open-source, no sign-up! Visit deepwiki com or just swap github → deepwiki on any repo URL:

Saining Xie (@sainingxie) 's Twitter Profile Photo

Wow, Deeply Supervised Nets received the Test of Time award at AISTATS Conference 2025! It was the very first paper I submitted during my PhD. Fun fact: the paper was originally rejected by NeurIPS with scores of 8/8/7 (yes, that pain stuck with me... maybe now I can finally let it

zhyncs (@zhyncs42) 's Twitter Profile Photo

MLSys 2025 is coming up! Want to meet the developers behind FlashInfer, XGrammar, and SGLang LMSYS Org in person? Join us for the Happy Hour on May 12—we’d love to see you there! lu.ma/dl99yjoe

Si-ze Zheng (@deeplyignorant) 's Twitter Profile Photo

🚀 We released Triton-distributed! 🌟 Build compute-comm. overlapping kernels for GPUs—performance rivals optimized libraries 🔗 github.com/ByteDance-Seed… 👏 Shoutout to AMD for testing our work! Check their blog: 🔗 …rocm-blogs--981.com.readthedocs.build/projects/inter…

Yixin Dong (@yi_xin_dong) 's Twitter Profile Photo

We are hosting a happy hour with LMSYS Org at #mlsys2025! Join us for engaging talks on SGLang, the structured generation library XGrammar, and the high-performance kernel library FlashInfer. Enjoy great food, lively discussions, and connect with the community! Click to join 👉

We are hosting a happy hour with <a href="/lmsysorg/">LMSYS Org</a> at #mlsys2025! Join us for engaging talks on SGLang, the structured generation library XGrammar, and the high-performance kernel library FlashInfer. Enjoy great food, lively discussions, and connect with the community! Click to join 👉
NVIDIA AI Developer (@nvidiaaidev) 's Twitter Profile Photo

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and

Zihao Ye (@ye_combinator) 's Twitter Profile Photo

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to LMSYS Org’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

Xinyu Yang (@xinyu2ml) 's Twitter Profile Photo

🌟 Don't miss out! The paper submission deadline for the R2-FM workshop is May 30th (AoE). We welcome your related work contributions! ❤️‍🔥

𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8) 's Twitter Profile Photo

Hardware-Efficient Attention for Fast Decoding Princeton optimizes decoding by maximizing arithmetic intensity (FLOPs/byte) for better memory–compute efficiency: - GTA (Grouped-Tied Attention) Ties key/value states + partial RoPE → 2× arithmetic intensity vs. GQA, ½ KV cache,

Hardware-Efficient Attention for Fast Decoding

Princeton optimizes decoding by maximizing arithmetic intensity (FLOPs/byte) for better memory–compute efficiency:

- GTA (Grouped-Tied Attention)
Ties key/value states + partial RoPE → 2× arithmetic intensity vs. GQA, ½ KV cache,
Intology (@intologyai) 's Twitter Profile Photo

The 1st fully AI-generated scientific discovery to pass the highest level of peer review – the main track of an A* conference (ACL 2025). Zochi, the 1st PhD-level agent. Beta open.

Enze Xie (@xieenze_jr) 's Twitter Profile Photo

🚀 Fast-dLLM: 27.6× Faster Diffusion LLMs with KV Cache & Parallel Decoding 💥 Key Features🌟 - Block-Wise KV Cache Reuses 90%+ attention activations via bidirectional caching (prefix/suffix), enabling 8.1×–27.6× throughput gains with <2% accuracy loss 🔄 -

🚀 Fast-dLLM: 27.6× Faster Diffusion LLMs with KV Cache &amp; Parallel Decoding 💥  

Key Features🌟  
- Block-Wise KV Cache  
  Reuses 90%+ attention activations via bidirectional caching (prefix/suffix), enabling 8.1×–27.6× throughput gains with &lt;2% accuracy loss 🔄  
-
Hao Kang (@gt_haokang) 's Twitter Profile Photo

🚀📉 A new kind of efficiency challenge: "Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs" We explore a new frontier: what if the reward doesn’t come from being right—but from being fast and right? 🔗 arxiv.org/abs/2505.19481 🛜

Databricks (@databricks) 's Twitter Profile Photo

Announcing Agent Bricks: auto-optimize agents for your domain tasks. Provide a high-level description of the agent’s task, and connect your enterprise data — Agent Bricks handles the rest. Agent Bricks builds out an agent system that automatically optimizes against your goals

Infini-AI-Lab (@infiniailab) 's Twitter Profile Photo

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

Xinyu Yang (@xinyu2ml) 's Twitter Profile Photo

🚀 Super excited to share Multiverse! 🏃 It’s been a long journey exploring the space between model design and hardware efficiency. What excites me most is realizing that, beyond optimizing existing models, we can discover better model architectures by embracing system-level

Zhihao Jia (@jiazhihao) 's Twitter Profile Photo

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard.

🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized