Kimbo Chen (@kimbochen) 's Twitter Profile
Kimbo Chen

@kimbochen

High-performance ML algorithms, compilers, and systems

ID: 2870711864

linkhttps://github.com/kimbochen/md-blogs calendar_today22-10-2014 10:53:31

1,1K Tweet

381 Takipçi

583 Takip Edilen

stochasm (@stochasticchasm) 's Twitter Profile Photo

This paper answers a long-standing question I had and claims that decay + merging does not outperform merging alone, which simplifies things quite nicely

This paper answers a long-standing question I had and claims that decay + merging does not outperform merging alone, which simplifies things quite nicely
Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low! -

Xiangming Gu @ ICLR 2025 (@gu_xiangming) 's Twitter Profile Photo

I noticed that OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4…. I used learnable key bias and set corresponding value bias zero. In this way,

I noticed that <a href="/OpenAI/">OpenAI</a> added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4….
I used learnable key bias and set corresponding value bias zero. In this way,
Feng Yao (@fengyao1909) 's Twitter Profile Photo

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog:

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL?

⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights!

📉 Blog:
Jonathan Chang (@cccntu) 's Twitter Profile Photo

while we wait for gpt-5 to drop. Here is a flex attention tutorial for building a < 1000 LoC vllm from scratch jonathanc.net/blog/vllm-flex…

Guangxuan Xiao (@guangxuan_xiao) 's Twitter Profile Photo

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models.

For those interested in the details:
hanlab.mit.edu/blog/streaming…
Jinjie Ni @ ICLR'25 🇸🇬 (@nijinjie) 's Twitter Profile Photo

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens

Token crisis: solved. ✅

We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs.

Findings:
&gt;  DLMs beat AR when tokens are limited, with &gt;3× data potential.
&gt;  A 1B DLM trained on just 1B tokens
wh (@nrehiew_) 's Twitter Profile Photo

Let's talk about the GLM 4.5 models. The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.

Let's talk about the GLM 4.5 models.

The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.
Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Another interesting observation performing SGD on x-entropy loss on your text corpus is equivalent to REINFORCE, i.e., on-policy policy gradient, with binary reward "Did my model generate text from corpus"

Feng Yao (@fengyao1909) 's Twitter Profile Photo

Liyuan Liu (Lucas) Chengyu Dong Dinghuai Zhang 张鼎怀 Jingbo Shang Jianfeng Gao (2/4) What’s the 𝐬𝐞𝐜𝐫𝐞𝐭 𝐬𝐚𝐮𝐜𝐞? We build on our previous 𝐭𝐫𝐮𝐧𝐜𝐚𝐭𝐞𝐝 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐬𝐚𝐦𝐩𝐥𝐢𝐧𝐠 (𝐓𝐈𝐒) blog (fengyao.notion.site/off-policy-rl) to address this issue. Here’s a quick summary of how it works.

<a href="/LiyuanLucas/">Liyuan Liu (Lucas)</a> <a href="/chengyu77/">Chengyu Dong</a> <a href="/zdhnarsil/">Dinghuai Zhang 张鼎怀</a> <a href="/shangjingbo/">Jingbo Shang</a> <a href="/JianfengGao0217/">Jianfeng Gao</a> (2/4) What’s the 𝐬𝐞𝐜𝐫𝐞𝐭 𝐬𝐚𝐮𝐜𝐞?

We build on our previous 𝐭𝐫𝐮𝐧𝐜𝐚𝐭𝐞𝐝 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐬𝐚𝐦𝐩𝐥𝐢𝐧𝐠 (𝐓𝐈𝐒) blog (fengyao.notion.site/off-policy-rl) to address this issue. Here’s a quick summary of how it works.
Mika Senghaas (@mikasenghaas) 's Twitter Profile Photo

moving from vllm v0 to v1 made our async rl training crash! read how we fixed it we recently migrated from v0 to v1 as part of a larger refactor of prime-rl to make it easier-to-use, more performant and naturally async. we confirmed correct training dynamics on many

moving from vllm v0 to v1 made our async rl training crash! read how we fixed it

we recently migrated from v0 to v1 as part of a larger refactor of prime-rl to make it easier-to-use, more performant and naturally async. we confirmed correct training dynamics on many
Si-ze Zheng (@deeplyignorant) 's Twitter Profile Photo

🎉 Excited to share: We’ve open-sourced Triton-distributed MegaKernel! A fresh, powerful take on MegaKernel for LLMs—built entirely on our Triton-distributed framework. github.com/ByteDance-Seed… Why it’s awesome? 🧩 Super programmable ⚡ Blazing performance 📊 Rock-solid precision

🎉 Excited to share: We’ve open-sourced Triton-distributed MegaKernel! A fresh, powerful take on MegaKernel for LLMs—built entirely on our Triton-distributed framework.
github.com/ByteDance-Seed…

Why it’s awesome?
🧩 Super programmable
⚡ Blazing performance
📊 Rock-solid precision
surya (@suryasure05) 's Twitter Profile Photo

I spent my summer building TinyTPU : An open source ML inference and training chip. it can do end to end inference + training ENTIRELY on chip. here's how I did it👇:

Stuart Sul (@stuart_sul) 's Twitter Profile Photo

MoE layers can be really slow. When training our coding models Cursor, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our

MoE layers can be really slow. When training our coding models <a href="/cursor_ai/">Cursor</a>, they ate up 27–53% of training time.

So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup.

We believe our
SemiAnalysis (@semianalysis_) 's Twitter Profile Photo

H100 vs GB200 NVL72 Training Benchmarks Power, TCO, and Reliability Analysis, Software Improvement Over Time Joules per Token, TCO Per Million Tokens, MFU Tokens Per US Annual Household Energy Usage, DeepSeek 670B GB200 Unreliability, Backplane Downtime semianalysis.com/2025/08/20/h10…

elie (@eliebakouch) 's Twitter Profile Photo

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale! > It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training. > They use WSD with a "Simple moving

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

&gt; It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
&gt; They use WSD with a "Simple moving