Stella Li (@stellalisy) 's Twitter Profile
Stella Li

@stellalisy

PhD student @uwnlp | visiting researcher @AIatMeta | undergrad @jhuclsp #NLProc

ID: 1517903076765495297

linkhttp://stellalisy.com calendar_today23-04-2022 16:28:14

139 Tweet

1,1K Takipçi

393 Takip Edilen

Rulin Shao (@rulinshao) 's Twitter Profile Photo

100% agree! In our recent work, we show RLVR can even work with random rewards on Qwen2.5-Math. However, all these surprising phenomena are more of an artifact of certain models--not generalizable to models with different prior, also unlikely large scale🤔 x.com/StellaLisy/sta…

Kyle Corbitt (@corbtt) 's Twitter Profile Photo

"RL from a single example works" "RL with random rewards works" "Base model pass@256 can match RL model pass@1" "RL updates a small % of params" Recent papers all point in the same direction: RL is mostly just eliciting latent behavior already learned in pretraining, not

Rulin Shao (@rulinshao) 's Twitter Profile Photo

One more fun thing! RLVR can elicit existing behaviors like code reasoning. But! If your model is not good at code but thought it could? - RLVR w/ spurious rewards let Olmo use more code: but perf decreased (Fig 6) - When we discourage it not to: the perf goes up!🤣 (Fig 9)

One more fun thing! 
RLVR can elicit existing behaviors like code reasoning. But! If your model is not good at code but thought it could?

- RLVR w/ spurious rewards let Olmo use more code: but perf decreased (Fig 6)
- When we discourage it not to: the perf goes up!🤣 (Fig 9)
Stella Li (@stellalisy) 's Twitter Profile Photo

This is super cool, looking forward to seeing the final technical report! And curious to see the training dynamics of OctoThinker when doing RLVR in Spurious Rewards🤔

Rulin Shao (@rulinshao) 's Twitter Profile Photo

Probably augmented with a lot code solutions + numerical perturbations. We have more examples in appendix🙂the code reasoning behavior generalizes when we change the numbers in the question, but the model may not use code when we change the narrative (acc still not bad tho)

Nathan Lambert (@natolambert) 's Twitter Profile Photo

Another paper examining incorrect or noisy rewards for RLVR! "We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function’s outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid

Another paper examining incorrect or noisy rewards for RLVR!

"We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function’s outputs in math tasks still allows a
Qwen-2.5-7B model to achieve rapid
Qinan Yu (@qinan_yu) 's Twitter Profile Photo

🎀 fine-grained, interpretable representation steering for LMs! meet RePS — Reference-free Preference Steering! 1⃣ outperforms existing methods on 2B-27B LMs, nearly matching prompting 2⃣ supports both steering and suppression (beat system prompts!) 3⃣ jailbreak-proof (1/n)

🎀 fine-grained, interpretable representation steering for LMs!
meet RePS — Reference-free Preference Steering!

1⃣ outperforms existing methods on 2B-27B LMs, nearly matching prompting
2⃣ supports both steering and suppression (beat system prompts!)
3⃣ jailbreak-proof

(1/n)
Omar Khattab (@lateinteraction) 's Twitter Profile Photo

Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long

Stella Li (@stellalisy) 's Twitter Profile Photo

Unified eval is a great direction for us all to push for. But I have questions on this particular study... Why are the "Actual Pre-RL" compared to "Reported RL" to compute gains? shouldn't we compare w "Actual RL"? afaik we didn't see -6.9 for 1shot RL across exps. Read w caution

Yiping Wang (@ypwang61) 's Twitter Profile Photo

I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, as we recently updated(x.com/ypwang61/statu……) But I disagree with some points in the blog for 1-shot RLVR. 1. For Deepseek-R1-Distill-Qwen-1.5B, we set

I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, as we recently updated(x.com/ypwang61/statu……) But I disagree with some points in the blog for 1-shot RLVR.

  1. For Deepseek-R1-Distill-Qwen-1.5B, we set
CLS (@chengleisi) 's Twitter Profile Photo

This year, there have been various pieces of evidence that AI agents are starting to be able to conduct scientific research and produce papers end-to-end, at a level where some of these generated papers were already accepted by top-tier conferences/workshops. Intology’s

Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

We do appreciate their efforts in writing the criticisms, but “turns out that the results in this paper are misreported” is a strong claim without running evaluation themselves. Such claim was also generalized to many other papers in a more recent blog (safe-lip-9a8.notion.site/Incorrect-Base…),

We do appreciate their efforts in writing the criticisms, but “turns out that the results in this paper are misreported” is a strong claim without running evaluation themselves. Such claim was also generalized to many other papers in a more recent blog (safe-lip-9a8.notion.site/Incorrect-Base…),