Junru Shao (@junrushao) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

BosonAI

@boson_ai

a year ago

Excited to share Higgs-V2, improved both general and roleplaying abilities. The performance boost comes from the in-house built reward model. More at boson.ai/higgs-v2/

thumb_up_off_alt10

chat_bubble_outline2

repeat4

shareShare

📚🧑‍🎓New tutorial on WGMMA (WarpGroup Matrix Multiplication and Accumulation) research.colfax-intl.com/cutlass-tutori… If you have run PyTorch, Jax, or FlashAttention-3 on an H100 GPU, you have used WGMMA. Arguably the most important primitive in the H100's Hopper architecture, WGMMA is the

thumb_up_off_alt218

chat_bubble_outline6

repeat50

shareShare

Junru Shao

@junrushao

9 months ago

Novelty considered harmful in this case. PyTorch/numpy syntax is a proven de facto standard to general users, so there’s literally no reason to reinvent the wheels

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Junru Shao

@junrushao

9 months ago

Always enjoy reading Yuchen Jin’s thread and thanks for the transparency from Hyperbolic

thumb_up_off_alt12

chat_bubble_outline2

repeat0

shareShare

Yixin Dong

@yi_xin_dong

7 months ago

🚀✨Introducing XGrammar: a fast, flexible, and portable engine for structured generation! 🤖Accurate JSON/grammar generation ⚡️3-10x speedup in latency 🤝Easy LLM engine integration ✅ Now in MLC-LLM, SGLang, WebLLM; vLLM & HuggingFace coming soon! blog.mlc.ai/2024/11/22/ach…

thumb_up_off_alt258

chat_bubble_outline6

repeat65

shareShare

Tianqi Chen

@tqchenml

7 months ago

🚀Future LLM agents speak JSON, python, and other structures. Excited to announce XGrammar, an structured generation library that enables zero-overhead structure constraining. Bring 2x-10x speedup in grammar guided LLM serving. Checkout github repo, blog to learn more 👉

thumb_up_off_alt230

chat_bubble_outline3

repeat60

shareShare

Zihao Ye

@ye_combinator

6 months ago

We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and

thumb_up_off_alt163

chat_bubble_outline6

repeat41

shareShare

Vinod Grover

@vinodg

6 months ago

Latest version of flashInfer paper with some cool ideas!

thumb_up_off_alt19

chat_bubble_outline0

repeat2

shareShare

Hongyi Jin

@hongyijin258

5 months ago

🚀Making cross-engine LLM serving programmable. Introducing LLM Microserving: a new RISC-style approach to design LLM serving API at sub-request level. Scale LLM serving with programmable cross-engine serving patterns, all in a few lines of Python. blog.mlc.ai/2025/01/07/mic…

thumb_up_off_alt64

chat_bubble_outline0

repeat30

shareShare

Charlie Ruan

@charlie_ruan

5 months ago

DeepSeek R1 Distilled models now on #WebLLM — locally accelerated by WebGPU and counting "r"s in 🍓 Reasoning models join the edge regime; small models are increasingly capable—excited to see what value edge can bring in 2025. Try it w/ no setup at chat.webllm.ai

thumb_up_off_alt10

chat_bubble_outline1

repeat7

shareShare

Lei Wang

@lei_wang_1999

4 months ago

Excited to release tilelang v0.1.0, another pythonic dsl for writing AI kernels with optional layout/pipeline annotations, and optional thread-level programming interface. If these features sound useful, please check it out and give a try :) github.com/tile-ai/tilela…

thumb_up_off_alt81

chat_bubble_outline6

repeat23

shareShare

Lei Wang

@lei_wang_1999

4 months ago

Building on top of tvm is powerful! 🙌 I was able to adapt WGSL (WebGPU codegen) from TVM to Tile language in just a few hours, and believe adapting Hexagon, Metal, and other backends should be just as straightforward. Contributions are welcome! 🥰

thumb_up_off_alt24

chat_bubble_outline1

repeat3

shareShare

Shiyi Cao

@shiyi_c98

4 months ago

Thanks AK for sharing our new work (a great effort led by Dacheng Li ) in the coding domain NovaSky! S* extends parallel scaling with sequential refinement and enhances selection with adaptive input synthesis, achieving superior performance and great

thumb_up_off_alt29

chat_bubble_outline1

repeat3

shareShare

Tianqi Chen

@tqchenml

4 months ago

Checkout young professional symposium at #MLSys2025!

thumb_up_off_alt15

chat_bubble_outline0

repeat5

shareShare

Zihao Ye

@ye_combinator

3 months ago

LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0

thumb_up_off_alt40

chat_bubble_outline0

repeat9

shareShare

Lei Wang

@lei_wang_1999

3 months ago

Happy to announce tilelang v0.1.3 🚀 Love to see and huge thanks for contributors to bring enhancements, optimizations, and bug fixes including Cute upgrades ✨, New kernels and tutorials like DeepGEMM⚡, Autotuning and Kernel Caches💾, and many more : ) github.com/tile-ai/tilela…

thumb_up_off_alt95

chat_bubble_outline0

repeat7

shareShare

Tianqi Chen

@tqchenml

3 months ago

Happy to share our latest work at ASPLOS 2025! LLMs are dynamic, both in sequence and batches. Relax brings an ML compiler IR that globally tracks symbolic shapes across functions on multiple levels. Bring efficient and flexible LLM AOT compilation arxiv.org/abs/2311.02103.

thumb_up_off_alt135

chat_bubble_outline4

repeat35

shareShare

Lequn Chen

@abcdabcd987

2 months ago

Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.

thumb_up_off_alt31

chat_bubble_outline0

repeat8

shareShare

Lei Wang

@lei_wang_1999

11 days ago

The DeepSeek team is so audacious as they tried writing tilelang kernels🥰, and luckily it's fast. Huge thanks for giving tilelang a try

thumb_up_off_alt426

chat_bubble_outline1

repeat31

shareShare

Infini-AI-Lab

@infiniailab

4 days ago

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

thumb_up_off_alt207

chat_bubble_outline2

repeat76

shareShare

Junru Shao

Gate.io

BosonAI

Hieu Pham

Junru Shao

Junru Shao

Yixin Dong

Tianqi Chen

Zihao Ye

Vinod Grover

Hongyi Jin

Charlie Ruan

Lei Wang

Lei Wang

Shiyi Cao

Tianqi Chen

Zihao Ye

Lei Wang

Tianqi Chen

Lequn Chen

Lei Wang

Infini-AI-Lab