Wenhao Chai (@wenhaocha1) Twitter Tweets • TwiCopy

Wenhao Chai

@wenhaocha1

+ Follow

Incoming CS Ph.D. Student @PrincetonCS. Prev @UW @Stanford @pika_labs @MSFTResearch @UofIllinois @ZJU_China. I work on computer vision, but it's not all I do.

ID: 1483945570595127298

linkhttp://wenhaochai.com calendar_today19-01-2022 23:32:58

493 Tweet

843 Followers

1,1K Following

Wenhao Chai

@wenhaocha1

2 months ago

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training arxiv.org/abs/2507.17634 Let's back to the gradient view, we can see model merge and data mixture are both gradient merge, and learning rate schedule is gradient weighted merge. The only concern is

thumb_up_off_alt57

chat_bubble_outline1

repeat7

shareShare

JingyuanLiu

@jingyuanliu123

2 months ago

I am quickly going through the code: github.com/microsoft/dion and found some personally interesting parts and manually ping You Jiacheng for any thoughts: Some good details about Muon: 1. Dion is using MuP + Muon (so it is correct to use Jeremy Bernstein 's spectral norm control rather

thumb_up_off_alt163

chat_bubble_outline7

repeat15

shareShare

Ziqian Zhong

@fjzzq2002

2 months ago

🤖 Some company just released a new set of open-weight LLMs well-suited for your production environment. However, you suspect that the models might be trained with backdoors or other hidden malicious behaviors. Is it still possible to deploy these models worry-free? (1/7)

thumb_up_off_alt48

chat_bubble_outline3

repeat22

shareShare

Wenhao Chai

@wenhaocha1

2 months ago

Bye AIME!

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Jiayi Weng

@trinkle23897

2 months ago

Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) Shengjia Zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library. github.com/openai/harmony

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat125

shareShare

Wenhao Chai

@wenhaocha1

2 months ago

pretty much!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Wenhao Chai

@wenhaocha1

2 months ago

Deep dive into Sink Value in GPT-OSS models! Analyzed 20B (24 layers) and 120B (36 layers) models and found (correct me if I'm wrong) Key Findings: 1. 20B model has larger sink value, 20B: mean=2.45, 120B: mean=1.93, 2. Clear swa/full-attn layer alternation: full-attn layers

thumb_up_off_alt149

chat_bubble_outline5

repeat17

shareShare

Jiayi Weng

@trinkle23897

2 months ago

As GPT-5 launches today, it's hard to forget the first ChatGPT-4 model called 0915-gpt4 internally at 2022. Shengjia Zhao did RM, I did PPO and deployed it that Friday for John Schulman to test. First prompt was tic-tac-toe, and it played surprisingly well. Time flies!

thumb_up_off_alt170

chat_bubble_outline13

repeat8

shareShare

Tianfu Fu

@tianfuf

2 months ago

GPT-5 is finally here! 🚀 Honored to be one of its core contributors. I designed, built, and trained the scalable, cost-efficient integration model that unifies reasoning and non-reasoning models, and drove extensive inference optimizations—making it possible to deliver GPT-5 at

thumb_up_off_alt390

chat_bubble_outline42

repeat23

shareShare

Guangxuan Xiao

@guangxuan_xiao

2 months ago

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

thumb_up_off_alt895

chat_bubble_outline17

repeat114

shareShare

Wenhao Chai

@wenhaocha1

2 months ago

wow!

thumb_up_off_alt12

chat_bubble_outline3

repeat1

shareShare

Jinjie Ni @ ICLR'25 🇸🇬

@nijinjie

2 months ago

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat187

shareShare

Sansa Gong

@sansa19739319

2 months ago

1–2 years ago, when I first started training text diffusion models, I had this empirical feeling that they could handle more epochs of training data. It’s great to now see the community sharing experiment logs using the "LM as physics" research approach.🤗

thumb_up_off_alt34

chat_bubble_outline1

repeat3

shareShare

Jiaxin Shi

@thjashin

2 months ago

To be fair, I’m not saying there is no hope - it’s just that there is no evidence that the cross point exists in the non-overfitting regime.

thumb_up_off_alt92

chat_bubble_outline5

repeat6

shareShare

Wenhao Chai

@wenhaocha1

2 months ago

I guess we probably need reverse one for diffusion post-training in vision.

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Peter Tong

@tongpetersb

2 months ago

Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench (arxiv.org/abs/2406.16860) and Blink (arxiv.org/abs/2404.12390), which repurpose core vision tasks into VQA format. These benchmarks do help

thumb_up_off_alt61

chat_bubble_outline1

repeat14

shareShare

Wenhao Chai

@wenhaocha1

2 months ago

GPT-5, think more. In our latest LiveCodeBench Pro tests for Competitive Programming, GPT-5 Thinking hit a true 0→1 moment in 2025 Q1 set, the only model to crack the hard split, and this wasn’t even GPT-5 Thinking Pro. Average response length exceeded 100,000 tokens, which is

thumb_up_off_alt166

chat_bubble_outline8

repeat18

shareShare

Wenhao Chai

@wenhaocha1

2 months ago

This quite matches my expectations for the strongest model. GPT-5-Thinking (not Pro) is already the top model on LiveCodeBench Pro. And this one it the real GPT-5! Huge congradulations to your team! livecodebenchpro.com

thumb_up_off_alt0

chat_bubble_outline2

repeat0

shareShare