Charlie Ruan (@charlie_ruan) 's Twitter Profile
Charlie Ruan

@charlie_ruan

MSCS @CSDatCMU | prev @CornellCIS

ID: 2733746502

linkhttps://www.charlieruan.com calendar_today15-08-2014 05:49:35

125 Tweet

569 Followers

347 Following

Vaibhav (VB) Srivastav (@reach_vb) 's Twitter Profile Photo

Fuck it! Structured Generation w/ SmolLM2 running in browser & WebGPU 🔥 Powered by MLC Web-LLM & XGrammar ⚡ Define a JSON schema, Input free text, get structured data right in your browser - profit!! To showcase how much you can do with just a 1.7B LLM, you pass free text,

Simon Willison (@simonw) 's Twitter Profile Photo

Amazing demo by Vaibhav Srivastav of structured data extraction running on an LLM that executes entirely in the browser (Chrome only for the moment since it uses WebGPU). simonwillison.net/2024/Nov/29/st…

Zihao Ye (@ye_combinator) 's Twitter Profile Photo

We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and

We are excite to announce FlashInfer v0.2!

Core contributions of this release include:
- Block/Vector  Sparse (Paged) Attention on FlashAttention-3 
- JIT compilation for customized attention variants
- Fused Multi-head Latent Attention (MLA) decoding kernel
- Lots of bugfix and
Hongyi Jin (@hongyijin258) 's Twitter Profile Photo

🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/mic…

🚀Making cross-engine LLM serving programmable. 
Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python.
blog.mlc.ai/2025/01/07/mic…
Tianqi Chen (@tqchenml) 's Twitter Profile Photo

🚀Future LLM serving moves towards multiple engines. Excited to introduce Microserving, a new LLM service API design to scale and disaggregate at sub-request level. Enables programmable LLM serving orchestration patterns in a few lines of python code. Checkout blog to learn more

Zhihao Jia (@jiazhihao) 's Twitter Profile Photo

Introducing LLM Microserving. Accelerate LLM inference with our framework that allows fine-grained, sub-request orchestration. 🚀 Key idea: a new API design enables dynamic reconfiguration of LLM serving strategies using just a few lines of Python. Read our blog to learn more.

Chrome for Developers (@chromiumdev) 's Twitter Profile Photo

Build private web apps with WebLLM. Google Developer Expert, Christian Liebel (🦋 @christianliebel.com) walks you through adding WebLLM to a to-do list app, enabling local LLM inference with WebAssembly and WebGPU. See how it works → goo.gle/40laHSa

Build private web apps with WebLLM. 

Google Developer Expert, <a href="/christianliebel/">Christian Liebel (🦋 @christianliebel.com)</a> walks you through adding WebLLM to a to-do list app, enabling local LLM inference with WebAssembly and WebGPU. 

See how it works → goo.gle/40laHSa
Jeremy Tuloup (@jtpio) 's Twitter Profile Photo

What if we could use AI models like Llama 3.2 or Mistral 7B in the browser with JupyterLite? 🤯 Still at a very early stage of course, but making some good progress! Thanks to WebLLM, which brings hardware accelerated language model inference onto web browsers, via WebGPU 🚀

Tianqi Chen (@tqchenml) 's Twitter Profile Photo

Happy to share our latest work at ASPLOS 2025! LLMs are dynamic, both in sequence and batches. Relax brings an ML compiler IR that globally tracks symbolic shapes across functions on multiple levels. Bring efficient and flexible LLM AOT compilation arxiv.org/abs/2311.02103.

Yixin Dong (@yi_xin_dong) 's Twitter Profile Photo

XGrammar is accepted to MLSys 2025🎉🎉🎉 It is a widely adopted library for structured generation with LLMs—output clean JSON, function calling, custom grammars, and more, exactly as specified. Now the default backend in MLC-LLM/SGLang/vLLM/TRT-LLM, with over 5M downloads. Check

XGrammar is accepted to MLSys 2025🎉🎉🎉
It is a widely adopted library for structured generation with LLMs—output clean JSON, function calling, custom grammars, and more, exactly as specified.
Now the default backend in MLC-LLM/SGLang/vLLM/TRT-LLM, with over 5M downloads.

Check
CMU School of Computer Science (@scsatcmu) 's Twitter Profile Photo

Huge thank you to NVIDIA Data Center for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

Huge thank you to <a href="/NVIDIADC/">NVIDIA Data Center</a> for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.
Tim Dettmers (@tim_dettmers) 's Twitter Profile Photo

Happy to announce that I joined the CMU Catalyst with three of my incoming students. Our research will bring the best models to consumer GPUs with a focus on agent systems and MoEs. It is amazing to see so many talented people at Catalyst -- a very exciting ecosystem!

Zhihao Jia (@jiazhihao) 's Twitter Profile Photo

Thank you to @NVIDIA for gifting our Catalyst Research Group the latest NVIDIA DGX B200! The B200 platform will greatly accelerate our research in building next-generation ML systems.🚀 #NVIDIADGX #DGXB200 NVIDIA Data Center

Tianqi Chen (@tqchenml) 's Twitter Profile Photo

Really thrilled to receive #NVIDIADGX B200 from NVIDIA . Looking forward to cooking with the beast. Together with an amazing team at CMU Catalyst group Beidi Chen Tim Dettmers Zhihao Jia Zico Kolter, We are looking at the innovate across entire stack from model to instructions

Zihao Ye (@ye_combinator) 's Twitter Profile Photo

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to LMSYS Org’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

uccl_project (@uccl_proj) 's Twitter Profile Photo

1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀 Code: github.com/uccl-project/u… Blog: uccl-project.github.io/posts/about-uc… Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA

1/N 📢 Introducing UCCL (Ultra &amp; Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀

Code: github.com/uccl-project/u…
Blog: uccl-project.github.io/posts/about-uc…
Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA
Zhihao Jia (@jiazhihao) 's Twitter Profile Photo

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard.

🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized
Chris Donahue (@chrisdonahuey) 's Twitter Profile Photo

Excited to announce 🎵Magenta RealTime, the first open weights music generation model capable of real-time audio generation with real-time control. 👋 **Try Magenta RT on Colab TPUs**: colab.research.google.com/github/magenta… 👀 Blog post: g.co/magenta/rt 🧵 below