Brian Huang ✈️ ICLR (@brianryhuang) 's Twitter Profile
Brian Huang ✈️ ICLR

@brianryhuang

@windsurf_ai
prev research at MIT madrylab & @haizelabs

ID: 1584559309828014080

linkhttp://briteroses.github.io calendar_today24-10-2022 14:56:06

2,2K Tweet

2,2K Takipçi

1,1K Takip Edilen

David Bau (@davidbau) 's Twitter Profile Photo

ACADEMICS: it is time to get our heads out of our *sses. This is not the moment for personal ambition, why your latest sophisticated widget beats rivals intricate theorem. The scientific franchise is under attack. It is time to defend it to the public. x.com/davidbau/statu…

Anne Ouyang (@anneouyang) 's Twitter Profile Photo

✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6) [🔗 link in final post]

✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6)

[🔗 link in final post]
Zixuan Wang (@zzzixuanwang) 's Twitter Profile Photo

LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training? In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient! arxiv.org/abs/2505.23683 🧵 below (1/10)

LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training?

In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient!

arxiv.org/abs/2505.23683

🧵 below (1/10)
Brian Huang ✈️ ICLR (@brianryhuang) 's Twitter Profile Photo

Might be obvious q but I think it's important. Say you have N SFT pairs on the same input, (x, y_0), ..., (x, y_n). After SFT on these N pairs to turn model pi_0 --> pi_SFT, does it always happen that pi_SFT(y_i|x) >= pi_0(y_i|x) for all i, 0 <= i <= n?

X. Dong (@simonxindong) 's Twitter Profile Photo

It does not saturate yet. At NVIDIA, we present "prolonged RL" where we significantly scale up RL training steps (+2k) and problems (+130k). The improvement from RL scaling is surprising and exciting. The RL-ed model makes great progress on some problems that the base model

Brian Huang ✈️ ICLR (@brianryhuang) 's Twitter Profile Photo

I remember upperclassman undergrad years were so bad that everyone lost faith in me and I basically spent the summer afterwards holed up addicted to video games and feeling guilty. The way the ML industry market opened to young talent is super fortunate and basically saved me

Casper Hansen (@casper_hansen_) 's Twitter Profile Photo

Latent reasoning does not work. Evals don't reproduce, approach doesn't scale with parameters. $1200 spent on trying to reproduce Quiet-STaR $500 spent on trying to reproduce COCONUT graciously funded by PI 6 months ago when we thought this was hot before RLVR took off

Uzay @ paris (@uzpg_) 's Twitter Profile Photo

Kaiwan Turel, awzf , and I were researching long horizon reasoning (with Jacob Andreas). We found existing benchmarks’ hard problems often featured tricky puzzles, not tests of system understanding. So we made Breakpoint: a SWE benchmark designed to disambiguate this capability.

<a href="/kaivu/">Kaiwan Turel</a>, <a href="/atticuswzf/">awzf</a> , and I were researching long horizon reasoning (with <a href="/jacobandreas/">Jacob Andreas</a>). We found existing benchmarks’ hard problems often featured tricky puzzles, not tests of system understanding. So we made Breakpoint: a SWE benchmark designed to disambiguate this capability.
Varun Mohan (@_mohansolo) 's Twitter Profile Photo

With less than five days of notice, Anthropic decided to cut off nearly all of our first-party capacity to all Claude 3.x models. Given the short notice, we may see some short-term Claude 3.x model availability issues as we have very quickly ramped up capacity on other inference

Brian Huang ✈️ ICLR (@brianryhuang) 's Twitter Profile Photo

I'm gonna take a step back and only stick to academic research tweeting -- I still haven't been feeling well lately and I know my tweets have been erratic, sorry for putting that on the timeline

Brian Huang ✈️ ICLR (@brianryhuang) 's Twitter Profile Photo

Windsurf, Cursor, and the large model labs have never revolved a piece of marketing around badmouthing a competitor, and you should think about why that is

Brian Huang ✈️ ICLR (@brianryhuang) 's Twitter Profile Photo

One bottleneck on frontier model training I haven't seen much talk about: we want a way to convert natural language labels into weight updates that goes beyond the pretraining objective, SFT objective, or policy gradients. The status quo is that you have your human or LLM