Wenhao Chai (@wenhaocha1) 's Twitter Profile
Wenhao Chai

@wenhaocha1

Incoming CS Ph.D. Student @PrincetonCS. Prev @UW @Stanford @pika_labs @MSFTResearch @UofIllinois @ZJU_China. I work on computer vision, but it's not all I do.

ID: 1483945570595127298

linkhttp://wenhaochai.com calendar_today19-01-2022 23:32:58

493 Tweet

843 Followers

1,1K Following

Wenhao Chai (@wenhaocha1) 's Twitter Profile Photo

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training arxiv.org/abs/2507.17634 Let's back to the gradient view, we can see model merge and data mixture are both gradient merge, and learning rate schedule is gradient weighted merge. The only concern is

JingyuanLiu (@jingyuanliu123) 's Twitter Profile Photo

I am quickly going through the code: github.com/microsoft/dion and found some personally interesting parts and manually ping You Jiacheng for any thoughts: Some good details about Muon: 1. Dion is using MuP + Muon (so it is correct to use Jeremy Bernstein 's spectral norm control rather

Ziqian Zhong (@fjzzq2002) 's Twitter Profile Photo

🤖 Some company just released a new set of open-weight LLMs well-suited for your production environment. However, you suspect that the models might be trained with backdoors or other hidden malicious behaviors. Is it still possible to deploy these models worry-free? (1/7)

🤖 Some company just released a new set of open-weight LLMs well-suited for your production environment. However, you suspect that the models might be trained with backdoors or other hidden malicious behaviors. Is it still possible to deploy these models worry-free? (1/7)
Jiayi Weng (@trinkle23897) 's Twitter Profile Photo

Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) Shengjia Zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library. github.com/openai/harmony

Wenhao Chai (@wenhaocha1) 's Twitter Profile Photo

Deep dive into Sink Value in GPT-OSS models! Analyzed 20B (24 layers) and 120B (36 layers) models and found (correct me if I'm wrong) Key Findings: 1. 20B model has larger sink value, 20B: mean=2.45, 120B: mean=1.93, 2. Clear swa/full-attn layer alternation: full-attn layers

Deep dive into Sink Value in GPT-OSS models! 
Analyzed 20B (24 layers) and 120B (36 layers) models and found (correct me if I'm wrong) Key Findings:   
1. 20B model has larger sink value, 20B: mean=2.45, 120B: mean=1.93,
2. Clear swa/full-attn layer alternation: full-attn layers
Jiayi Weng (@trinkle23897) 's Twitter Profile Photo

As GPT-5 launches today, it's hard to forget the first ChatGPT-4 model called 0915-gpt4 internally at 2022. Shengjia Zhao did RM, I did PPO and deployed it that Friday for John Schulman to test. First prompt was tic-tac-toe, and it played surprisingly well. Time flies!

Tianfu Fu (@tianfuf) 's Twitter Profile Photo

GPT-5 is finally here! 🚀 Honored to be one of its core contributors. I designed, built, and trained the scalable, cost-efficient integration model that unifies reasoning and non-reasoning models, and drove extensive inference optimizations—making it possible to deliver GPT-5 at

Guangxuan Xiao (@guangxuan_xiao) 's Twitter Profile Photo

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models.

For those interested in the details:
hanlab.mit.edu/blog/streaming…
Jinjie Ni @ ICLR'25 🇸🇬 (@nijinjie) 's Twitter Profile Photo

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens

Token crisis: solved. ✅

We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs.

Findings:
>  DLMs beat AR when tokens are limited, with >3× data potential.
>  A 1B DLM trained on just 1B tokens
Sansa Gong (@sansa19739319) 's Twitter Profile Photo

1–2 years ago, when I first started training text diffusion models, I had this empirical feeling that they could handle more epochs of training data. It’s great to now see the community sharing experiment logs using the "LM as physics" research approach.🤗

Jiaxin Shi (@thjashin) 's Twitter Profile Photo

To be fair, I’m not saying there is no hope - it’s just that there is no evidence that the cross point exists in the non-overfitting regime.

Peter Tong (@tongpetersb) 's Twitter Profile Photo

Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench (arxiv.org/abs/2406.16860) and Blink (arxiv.org/abs/2404.12390), which repurpose core vision tasks into VQA format. These benchmarks do help

Wenhao Chai (@wenhaocha1) 's Twitter Profile Photo

GPT-5, think more. In our latest LiveCodeBench Pro tests for Competitive Programming, GPT-5 Thinking hit a true 0→1 moment in 2025 Q1 set, the only model to crack the hard split, and this wasn’t even GPT-5 Thinking Pro. Average response length exceeded 100,000 tokens, which is

GPT-5, think more.

In our latest LiveCodeBench Pro tests for Competitive Programming, GPT-5 Thinking hit a true 0→1 moment in 2025 Q1 set, the only model to crack the hard split, and this wasn’t even GPT-5 Thinking Pro. Average response length exceeded 100,000 tokens, which is
Wenhao Chai (@wenhaocha1) 's Twitter Profile Photo

This quite matches my expectations for the strongest model. GPT-5-Thinking (not Pro) is already the top model on LiveCodeBench Pro. And this one it the real GPT-5! Huge congradulations to your team! livecodebenchpro.com