Matthew Yang (@matthewyryang) 's Twitter Profile
Matthew Yang

@matthewyryang

MSML student @ CMU

ID: 1820131135478808576

calendar_today04-08-2024 16:14:41

9 Tweet

5 Followers

76 Following

Yuxiao Qu (@quyuxiao) 's Twitter Profile Photo

🚨 NEW PAPER: "Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning"! 🤔 With all these long-reasoning LLMs, what are we actually optimizing for? Length penalties? Token budgets? We needed a better way to think about it! Website: cohenqu.github.io/mrt.github.io/ 🧵[1/9]

🚨 NEW PAPER: "Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning"!

🤔 With all these long-reasoning LLMs, what are we actually optimizing for? Length penalties? Token budgets? We needed a better way to think about it!

Website: cohenqu.github.io/mrt.github.io/

🧵[1/9]
Amrith Setlur (@setlur_amrith) 's Twitter Profile Photo

Scaling test-time compute is fine 😒 but are we making good use of it? 🤔 We try to answer this question in our new work: arxiv.org/pdf/2503.07572 TLDR; 🚀 *Optimizing* test-time compute = RL with dense (progress) rewards = minimizing regret over long CoT episodes 😲 🧵⤵️

Scaling test-time compute is fine 😒 but are we making good use of it? 🤔
We try to answer this question in our new work: arxiv.org/pdf/2503.07572
TLDR;
🚀 *Optimizing* test-time compute  = RL with dense (progress) rewards = minimizing regret over long CoT episodes  😲
🧵⤵️
Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean we use it well. My students developed "v0" of a paradigm to do this optimally by running RL with dense rewards = minimizing regret over long CoT episodes. 🧵⬇️

A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean we use it well.

My students developed "v0" of a paradigm to do this optimally by running RL with dense rewards = minimizing regret over long CoT episodes. 🧵⬇️
Po-Shen Loh (@poshenloh) 's Twitter Profile Photo

Oh my goodness. GPT-o1 got a perfect score on my Carnegie Mellon University undergraduate #math exam, taking less than a minute to solve each problem. I freshly design non-standard problems for all of my exams, and they are open-book, open-notes. (Problems included below, with links to

Oh my goodness. GPT-o1 got a perfect score on my <a href="/CarnegieMellon/">Carnegie Mellon University</a> undergraduate #math exam, taking less than a minute to solve each problem. I freshly design non-standard problems for all of my exams, and they are open-book, open-notes. (Problems included below, with links to
vitrupo (@vitrupo) 's Twitter Profile Photo

"We weren’t born to do jobs." Bill Gates says jobs are a relic of human scarcity. In a world without shortages, society will be able to produce enough—food, healthcare, services—without everyone working. The real shift won’t be economic. It’ll be reprogramming how we think

Quentin Gallouédec (@qgallouedec) 's Twitter Profile Photo

🤔 How do you explain that when we apply RL to math problems, the incorrect answers become longer than the correct ones? We had this discussion this morning, and I'm curious to know what the community thinks about it.

🤔 How do you explain that when we apply RL to math problems, the incorrect answers become longer than the correct ones?

We had this discussion this morning, and I'm curious to know what the community thinks about it.
Amrith Setlur (@setlur_amrith) 's Twitter Profile Photo

Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 arxiv.org/abs/2506.09026 🚨 matthewyryang.github.io/e3/

Introducing e3 🔥 Best &lt;2B model on math 💪
Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡?  We answer these ⤵️
🚨 arxiv.org/abs/2506.09026
🚨 matthewyryang.github.io/e3/
Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. Amrith Setlur & Matthew Yang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems.

<a href="/setlur_amrith/">Amrith Setlur</a> &amp; <a href="/matthewyryang/">Matthew Yang</a>'s new work e3 shows how RL done with this view produces best &lt;2B LLM on math that extrapolates beyond training budget. 🧵⬇️
Amrith Setlur (@setlur_amrith) 's Twitter Profile Photo

Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ tinyurl.com/rlshadis

Alexandr Wang (@alexandr_wang) 's Twitter Profile Photo

I’m excited to be the Chief AI Officer of Meta, working alongside Nat Friedman, and thrilled to be accompanied by an incredible group of people joining on the same day. Towards superintelligence 🚀

I’m excited to be the Chief AI Officer of <a href="/Meta/">Meta</a>, working alongside <a href="/natfriedman/">Nat Friedman</a>, and thrilled to be accompanied by an incredible group of people joining on the same day.

Towards superintelligence 🚀
Wan (@alibaba_wan) 's Twitter Profile Photo

🚀 Introducing Wan2.2: The World's First Open-Source MoE-Architecture Video Generation Model with Cinematic Control! 🔥 Key Innovations: ꔷ World's First Open-Source MoE Video Model: Our Mixture-of-Experts architecture scales model capacity without increasing computational