Stella Li (@stellalisy) Twitter Tweets • TwiCopy

Rulin Shao

3 months ago

100% agree! In our recent work, we show RLVR can even work with random rewards on Qwen2.5-Math. However, all these surprising phenomena are more of an artifact of certain models--not generalizable to models with different prior, also unlikely large scale🤔 x.com/StellaLisy/sta…

thumb_up_off_alt85

chat_bubble_outline0

repeat7

shareShare

Nathan Lambert

@natolambert

3 months ago

immortalizing this moment forever when RL is so easy that you can just use random rewards and your benchmarks still go up smh

thumb_up_off_alt1,1K

chat_bubble_outline21

repeat60

shareShare

Kyle Corbitt

@corbtt

3 months ago

"RL from a single example works" "RL with random rewards works" "Base model pass@256 can match RL model pass@1" "RL updates a small % of params" Recent papers all point in the same direction: RL is mostly just eliciting latent behavior already learned in pretraining, not

thumb_up_off_alt965

chat_bubble_outline44

repeat83

shareShare

Rulin Shao

@rulinshao

3 months ago

One more fun thing! RLVR can elicit existing behaviors like code reasoning. But! If your model is not good at code but thought it could? - RLVR w/ spurious rewards let Olmo use more code: but perf decreased (Fig 6) - When we discourage it not to: the perf goes up!🤣 (Fig 9)

thumb_up_off_alt129

chat_bubble_outline2

repeat24

shareShare

Stella Li

@stellalisy

3 months ago

Haha keep the memes coming😆 x.com/RulinShao/stat…

thumb_up_off_alt37

chat_bubble_outline0

repeat2

shareShare

Stella Li

@stellalisy

3 months ago

This is super cool, looking forward to seeing the final technical report! And curious to see the training dynamics of OctoThinker when doing RLVR in Spurious Rewards🤔

thumb_up_off_alt36

chat_bubble_outline1

repeat1

shareShare

Rulin Shao

@rulinshao

3 months ago

Probably augmented with a lot code solutions + numerical perturbations. We have more examples in appendix🙂the code reasoning behavior generalizes when we change the numbers in the question, but the model may not use code when we change the narrative (acc still not bad tho)

thumb_up_off_alt34

chat_bubble_outline1

repeat1

shareShare

Nathan Lambert

@natolambert

3 months ago

Another paper examining incorrect or noisy rewards for RLVR! "We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function’s outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid

thumb_up_off_alt171

chat_bubble_outline6

repeat13

shareShare

Qinan Yu

@qinan_yu

3 months ago

🎀 fine-grained, interpretable representation steering for LMs! meet RePS — Reference-free Preference Steering! 1⃣ outperforms existing methods on 2B-27B LMs, nearly matching prompting 2⃣ supports both steering and suppression (beat system prompts!) 3⃣ jailbreak-proof (1/n)

thumb_up_off_alt212

chat_bubble_outline1

repeat35

shareShare

Omar Khattab

@lateinteraction

3 months ago

Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long

thumb_up_off_alt573

chat_bubble_outline18

repeat79

shareShare

Stella Li

@stellalisy

3 months ago

Unified eval is a great direction for us all to push for. But I have questions on this particular study... Why are the "Actual Pre-RL" compared to "Reported RL" to compute gains? shouldn't we compare w "Actual RL"? afaik we didn't see -6.9 for 1shot RL across exps. Read w caution

thumb_up_off_alt40

chat_bubble_outline4

repeat3

shareShare

Yiping Wang

@ypwang61

3 months ago

I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, as we recently updated(x.com/ypwang61/statu……) But I disagree with some points in the blog for 1-shot RLVR. 1. For Deepseek-R1-Distill-Qwen-1.5B, we set

thumb_up_off_alt30

chat_bubble_outline1

repeat6

shareShare

Stella Li

@stellalisy

3 months ago

🧑‍🍳

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Stella Li

@stellalisy

3 months ago

Such a cool project! Are LLMs truly polyglots if they can’t reason in low resource languages🤔

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

CLS

@chengleisi

3 months ago

This year, there have been various pieces of evidence that AI agents are starting to be able to conduct scientific research and produce papers end-to-end, at a level where some of these generated papers were already accepted by top-tier conferences/workshops. Intology’s

thumb_up_off_alt212

chat_bubble_outline13

repeat43

shareShare

Zichen Liu @ ICLR2025

@zzlccc

3 months ago

We do appreciate their efforts in writing the criticisms, but “turns out that the results in this paper are misreported” is a strong claim without running evaluation themselves. Such claim was also generalized to many other papers in a more recent blog (safe-lip-9a8.notion.site/Incorrect-Base…),

thumb_up_off_alt75

chat_bubble_outline3

repeat10

shareShare