Charlie Ruan (@charlie_ruan) Twitter Tweets • TwiCopy

Vaibhav (VB) Srivastav

10 months ago

Fuck it! Structured Generation w/ SmolLM2 running in browser & WebGPU 🔥 Powered by MLC Web-LLM & XGrammar ⚡ Define a JSON schema, Input free text, get structured data right in your browser - profit!! To showcase how much you can do with just a 1.7B LLM, you pass free text,

thumb_up_off_alt334

chat_bubble_outline5

repeat57

shareShare

Simon Willison

@simonw

10 months ago

Amazing demo by Vaibhav Srivastav of structured data extraction running on an LLM that executes entirely in the browser (Chrome only for the moment since it uses WebGPU). simonwillison.net/2024/Nov/29/st…

thumb_up_off_alt360

chat_bubble_outline10

repeat38

shareShare

Zihao Ye

@ye_combinator

10 months ago

We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and

thumb_up_off_alt163

chat_bubble_outline6

repeat41

shareShare

Hongyi Jin

@hongyijin258

9 months ago

🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/mic…

thumb_up_off_alt64

chat_bubble_outline0

repeat30

shareShare

Tianqi Chen

@tqchenml

9 months ago

🚀Future LLM serving moves towards multiple engines. Excited to introduce Microserving, a new LLM service API design to scale and disaggregate at sub-request level. Enables programmable LLM serving orchestration patterns in a few lines of python code. Checkout blog to learn more

thumb_up_off_alt66

chat_bubble_outline0

repeat15

shareShare

Zhihao Jia

@jiazhihao

9 months ago

Introducing LLM Microserving. Accelerate LLM inference with our framework that allows fine-grained, sub-request orchestration. 🚀 Key idea: a new API design enables dynamic reconfiguration of LLM serving strategies using just a few lines of Python. Read our blog to learn more.

thumb_up_off_alt35

chat_bubble_outline3

repeat8

shareShare

Chrome for Developers

@chromiumdev

8 months ago

Build private web apps with WebLLM. Google Developer Expert, Christian Liebel (🦋 @christianliebel.com) walks you through adding WebLLM to a to-do list app, enabling local LLM inference with WebAssembly and WebGPU. See how it works → goo.gle/40laHSa

Build private web apps with WebLLM.

Google Developer Expert, <a href="/christianliebel/">Christian Liebel (🦋 @christianliebel.com)</a> walks you through adding WebLLM to a to-do list app, enabling local LLM inference with WebAssembly and WebGPU.

See how it works → goo.gle/40laHSa

thumb_up_off_alt160

chat_bubble_outline1

repeat35

shareShare

Jeremy Tuloup

@jtpio

8 months ago

What if we could use AI models like Llama 3.2 or Mistral 7B in the browser with JupyterLite? 🤯 Still at a very early stage of course, but making some good progress! Thanks to WebLLM, which brings hardware accelerated language model inference onto web browsers, via WebGPU 🚀

thumb_up_off_alt6

chat_bubble_outline0

repeat2

shareShare

Tianqi Chen

@tqchenml

6 months ago

Happy to share our latest work at ASPLOS 2025! LLMs are dynamic, both in sequence and batches. Relax brings an ML compiler IR that globally tracks symbolic shapes across functions on multiple levels. Bring efficient and flexible LLM AOT compilation arxiv.org/abs/2311.02103.

thumb_up_off_alt135

chat_bubble_outline4

repeat35

shareShare

Yixin Dong

@yi_xin_dong

6 months ago

XGrammar is accepted to MLSys 2025🎉🎉🎉 It is a widely adopted library for structured generation with LLMs—output clean JSON, function calling, custom grammars, and more, exactly as specified. Now the default backend in MLC-LLM/SGLang/vLLM/TRT-LLM, with over 5M downloads. Check

thumb_up_off_alt110

chat_bubble_outline3

repeat18

shareShare

CMU School of Computer Science

@scsatcmu

6 months ago

Huge thank you to NVIDIA Data Center for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

thumb_up_off_alt143

chat_bubble_outline3

repeat29

shareShare

Zico Kolter

@zicokolter

6 months ago

Thanks NVIDIA Data Center for the DGX B200 machine for the CMU Catalyst group! I'm perhaps already a bit too enthralled by it in the photos...

thumb_up_off_alt103

chat_bubble_outline3

repeat14

shareShare

Tim Dettmers

@tim_dettmers

6 months ago

Happy to announce that I joined the CMU Catalyst with three of my incoming students. Our research will bring the best models to consumer GPUs with a focus on agent systems and MoEs. It is amazing to see so many talented people at Catalyst -- a very exciting ecosystem!

thumb_up_off_alt351

chat_bubble_outline13

repeat59

shareShare

Zhihao Jia

@jiazhihao

6 months ago

Thank you to @NVIDIA for gifting our Catalyst Research Group the latest NVIDIA DGX B200! The B200 platform will greatly accelerate our research in building next-generation ML systems.🚀 #NVIDIADGX #DGXB200 NVIDIA Data Center

thumb_up_off_alt50

chat_bubble_outline0

repeat10

shareShare

Tianqi Chen

@tqchenml

6 months ago

Really thrilled to receive #NVIDIADGX B200 from NVIDIA . Looking forward to cooking with the beast. Together with an amazing team at CMU Catalyst group Beidi Chen Tim Dettmers Zhihao Jia Zico Kolter, We are looking at the innovate across entire stack from model to instructions

thumb_up_off_alt85

chat_bubble_outline0

repeat16

shareShare

Zihao Ye

@ye_combinator

5 months ago

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to LMSYS Org’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

thumb_up_off_alt229

chat_bubble_outline16

repeat37

shareShare

uccl_project

@uccl_proj

4 months ago

1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀 Code: github.com/uccl-project/u… Blog: uccl-project.github.io/posts/about-uc… Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA

thumb_up_off_alt31

chat_bubble_outline1

repeat13

shareShare

Zhihao Jia

@jiazhihao

4 months ago

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized

thumb_up_off_alt439

chat_bubble_outline6

repeat68

shareShare

Chris Donahue

@chrisdonahuey

4 months ago

Excited to announce 🎵Magenta RealTime, the first open weights music generation model capable of real-time audio generation with real-time control. 👋 **Try Magenta RT on Colab TPUs**: colab.research.google.com/github/magenta… 👀 Blog post: g.co/magenta/rt 🧵 below

thumb_up_off_alt131

chat_bubble_outline9

repeat28

shareShare