Rishabh Agarwal (@agarwl_) 's Twitter Profile
Rishabh Agarwal

@agarwl_

Research Scientist @AIatMeta, Adjunct Prof @McGillU. Prev at @GoogleDeepMind, Goog Brain, Mila, IIT Bombay. Reinforcement Learner. NeurIPS Best Paper (RLiable)

ID: 726727268391878656

linkhttps://agarwl.github.io calendar_today01-05-2016 10:57:37

1,1K Tweet

9,9K Takipçi

722 Takip Edilen

Arian Hosseini (@ariantbd) 's Twitter Profile Photo

New Paper! 📣 RL^V: a unified RL & generative verifier —boosts MATH accuracy by 20% and improves both sequential and parallel test-time scaling ☑️ improves out-of-domain and easy-to-hard generalization ☑️ allows dynamic allocation of compute for harder problems How? 👇🏻

Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

Idea: Merging generative verification and solution generarion during RL training of LLM reasoners. Why? This allows you to scale inference compute both sequentially (long CoT) and parallel (Best of N, weighted maj voting). Next? Generation and verification to be trained end

Prime Intellect (@primeintellect) 's Twitter Profile Photo

Releasing INTELLECT-2: We’re open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: • Detailed Technical Report • INTELLECT-2 model checkpoint primeintellect.ai/blog/intellect…

1a3orn (@1a3orn) 's Twitter Profile Photo

I suspect this paper's result have been oversold somewhat. As far as I can tell, nothing in the paper excludes the possibility that a quite large % of of the "learning" here is just "learns to be put answers in \\boxed{...}" tags.

I suspect this paper's result have been oversold somewhat.

As far as I can tell, nothing in the paper excludes the possibility that a quite large % of of the "learning" here is just "learns to be put answers in \\boxed{...}" tags.
rohan anil (@_arohan_) 's Twitter Profile Photo

Someone passed this wisdom to me today. Deep learning techniques working vs not working is two devils - your prior about the technique - your attention to details about implementation of the technique Need both to make it work.

Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

All you often need is just one lucky break. For me, it was Geoffrey Hinton who took a bet on me about 7 years ago. He said something along the following lines that stuck with me: “You have tried a bunch of interesting research directions , and all of them failed — that’s what

Lior⚡ (@lioronai) 's Twitter Profile Photo

An undergrad student broke a 40-year-old belief in computer science. Since 1985, it was believed that hash tables, when nearly full, must check many spots to find or add data. Andrew Krapivin discovered a new way to organize data inside a hash table that avoids this slowdown.

An undergrad student broke a 40-year-old belief in computer science.

Since 1985, it was believed that hash tables, when nearly full, must check many spots to find or add data.

Andrew Krapivin discovered a new way to organize data inside a hash table that avoids this slowdown.
Nathan Lambert (@natolambert) 's Twitter Profile Photo

The reason recent RLVR papers show mostly formatting and not learning new skills is just because no one has scaled up enough. If RL compute is <.1% of overall compute, ofc not much changes. I bet o3 is closer to 5% of total compute. 10-25% i bet the models feel different again.

Pablo Samuel Castro (@pcastr) 's Twitter Profile Photo

Mind the GAP! we've had a few works proposing techniques for enabling scaling in deep rl, such as MoEs, tokenization, & sparse training. Ghada Sokar and i looked further & found a bit more clarity into *what* enables scaling, leading us to simpler solutions (see GAP in figure)! 1/

Mind the GAP!

we've had a few works proposing techniques for enabling scaling in deep rl, such as MoEs, tokenization, &amp; sparse training.
<a href="/g_sokar/">Ghada Sokar</a> and i looked further &amp; found a bit more clarity into *what* enables scaling, leading us to simpler solutions (see GAP in figure)!
1/
Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

For Deepseek-R1, I am trying to estimate how much training compute was used for RL vs pre-training. My current estimate is 12-18% of pre-training compute was used for RL training! Is this estimate off? Pre-training: 2788K H800 hours for Deepseek-V3 Base. BS = 6144 with 4K seq

Rulin Shao (@rulinshao) 's Twitter Profile Photo

100% agree! In our recent work, we show RLVR can even work with random rewards on Qwen2.5-Math. However, all these surprising phenomena are more of an artifact of certain models--not generalizable to models with different prior, also unlikely large scale🤔 x.com/StellaLisy/sta…

Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Random labels helping makes sense when P(correct|random)>0 since the x-change rate of accuracy is much higher for true positive rather than false negative/false positive examples. Wrong labels working makes little sense, and is a result of undertraining + trajectories being

Sinclair Wang (@sinclairwang1) 's Twitter Profile Photo

I believe that we need a deeper understanding of the relationship between pre-training and RL scaling. How to perform pre-training better, making language models more suitable and smooth for RL scaling? That is to say, Pre-training for RL. If you are interested in it, welcome to

I believe that we need a deeper understanding of the relationship between pre-training and RL scaling. How to perform pre-training better, making language models more suitable and smooth for RL scaling? That is to say, Pre-training for RL.

If you are interested in it, welcome to
Ganqu Cui (@charlesfornlp) 's Twitter Profile Photo

So many works talking about entropy, but what is the **mechanism** of entropy in RL for LLMs? 🤔 Our work gives a principled understanding, as well as two tricks that get entropy **controlled** 🧵

So many works talking about entropy, but what is the **mechanism** of entropy in RL for LLMs? 🤔

Our work gives a principled understanding, as well as two tricks that get entropy **controlled** 🧵