Changyu Chen (@cameron_chann) 's Twitter Profile
Changyu Chen

@cameron_chann

PhD student @sgSMU. RL x LLMs. Previously @NTUsg, @ZJU_China

Post-training for Sailor2

ID: 1266549323581411328

linkhttps://cameron-chen.github.io/ calendar_today30-05-2020 01:58:14

95 Tweet

223 Followers

218 Following

Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

🚀 General-Reasoner: Generalizing LLM Reasoning Across All Domains (Beyond Math) Most recent RL/R1 works focus on math reasoning—but math-only tuning doesn't generalize to general reasoning (e.g. drop on MMLU-Pro and SuperGPQA). Why are we limited to math reasoning? 1. Existing

🚀 General-Reasoner: Generalizing LLM Reasoning Across All Domains (Beyond Math)

Most recent RL/R1 works focus on math reasoning—but math-only tuning doesn't generalize to general reasoning (e.g. drop on MMLU-Pro and SuperGPQA). Why are we limited to math reasoning?

1. Existing
Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

I can feel #ICLR2025 already starts… welcome everyone to 🇸🇬 Singapore! Lets meet and chat about RL, LLM and reasoning :)

Jinjie Ni @ ICLR'25 🇸🇬 (@nijinjie) 's Twitter Profile Photo

Remember the NoisyStudent topping ImageNet back in 2019🏆? Was it the last dance of noisy training? 🍻 Meet NoisyRollout, our new noisy training efforts in building stronger o1-like visual reasoners. ✨ With only 2.1k training data and zero additional training cost, it hits

Remember the NoisyStudent topping ImageNet back in 2019🏆? Was it the last dance of noisy training? 

🍻 Meet NoisyRollout, our new noisy training efforts in building stronger o1-like visual reasoners. 

✨ With only 2.1k training data and zero additional training cost, it hits
Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

🚨 RL x LLM folks at #ICLR2025 — come join us during the Friday lunch break! If you haven’t RSVP’d on Whova, you can also register here: lu.ma/s8udv997?tk=B4… Bo Liu (Benjamin Liu) and I will scout for a chill spot (likely a corner at the venue) and share the location tomorrow.

🚨 RL x LLM folks at #ICLR2025 — come join us during the Friday lunch break!

If you haven’t RSVP’d on Whova, you can also register here: lu.ma/s8udv997?tk=B4…

<a href="/Benjamin_eecs/">Bo Liu (Benjamin Liu)</a> and I will scout for a chill spot (likely a corner at the venue) and share the location tomorrow.
Hongfu Liu @ICLR 2025🇸🇬 (@waffle42567405) 's Twitter Profile Photo

I will attend #ICLR2025 🇸🇬 to present our work "On Calibration of LLM-based Guard Models for Reliable Content Moderation". We advocate reliability evaluation of LLM guardrail models as current ones are overconfident, miscalibrated, and brittle. —✨come see us at Hall 3 + Hall 2B

I will attend #ICLR2025 🇸🇬 to present our work "On Calibration of LLM-based Guard Models for Reliable Content Moderation". We advocate reliability evaluation of LLM guardrail models as current ones are overconfident, miscalibrated, and brittle.
—✨come see us at Hall 3 + Hall 2B
Fan Zhou✈️ICLR2025 (@fazhou_998) 's Twitter Profile Photo

Say hi to 🐙 OctoThinker — our new mid-training efforts for building strong reasoning base models tailored for the RL scaling era. Still a WIP, but we're excited to share our early insights into rethinking base model development. 📖 Blog: tinyurl.com/OctoThinker 🤗 Huggingface:

Say hi to 🐙 OctoThinker — our new mid-training efforts for building strong reasoning base models tailored for the RL scaling era. Still a WIP, but we're excited to share our early insights into rethinking base model development.

📖 Blog: tinyurl.com/OctoThinker
🤗 Huggingface:
Zeyuan Allen-Zhu, Sc.D. (@zeyuanallenzhu) 's Twitter Profile Photo

(1/8)🍎A Galileo moment for LLM design🍎 As Pisa Tower experiment sparked modern physics, our controlled synthetic pretraining playground reveals LLM architectures' true limits. A turning point that might divide LLM research into "before" and "after." physics.allen-zhu.com/part-4-archite…

(1/8)🍎A Galileo moment for LLM design🍎
As Pisa Tower experiment sparked modern physics, our controlled synthetic pretraining playground reveals LLM architectures' true limits. A turning point that might divide LLM research into "before" and "after." physics.allen-zhu.com/part-4-archite…
John Yang (@jyangballin) 's Twitter Profile Photo

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified.

We built it by synthesizing a ton of agentic training data from 100+ Python repos.

Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

Good catch! I ran into the same reusability issue ~1 year ago with OpenRLHF. That’s why I built oat🌾 (github.com/sail-sg/oat) — a modular RL LLM framework inspired by DeepMind’s ecosystem. Just define your actor, learner, and env in a single script — and you’re good to go :)

Changyu Chen (@cameron_chann) 's Twitter Profile Photo

Really interesting work for reasoning performance given any budget: - Better than GRPO at anytime including the final performance - Provide flexibility for users to trade off the cost and performance - Great engineering effort - code optimized for tree-like generation & training

Alexia Jolicoeur-Martineau (@jm_alexia) 's Twitter Profile Photo

I tried TRL again... I'm going back to OAT. Every time I try to use TRL, its always a nightmare. OAT is plug and play. github.com/sail-sg/oat

Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

Reinforcing General Reasoning without Verifiers 🈚️ R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity. Introducing *VeriFree*: ⚡ Skip

Reinforcing General Reasoning without Verifiers 🈚️

R1-Zero-like RL thrives in domains with verifiable rewards (code, math). But real-world reasoning (chem, bio, econ…) lacks easy rule-based verifiers — and model-based verifiers add complexity.

Introducing *VeriFree*:

⚡ Skip
Changyu Chen (@cameron_chann) 's Twitter Profile Photo

Highly agree! A strong prior is essential for the success of RL training in LLMs, as we show in the Llama experiments (arxiv.org/pdf/2503.20783); a strong prior also makes improvement so easy that it can create “RL just works” noise.

Jinjie Ni @ ICLR'25 🇸🇬 (@nijinjie) 's Twitter Profile Photo

Ready to supercharge your vision-language reasoners with scalable RL fuels? RL is promising, but good data is a bottleneck! 😤 🚀 Introducing SynthRL: A highly guaranteed and scalable pipeline to synthesize verifiable & progressively harder training data, tailor-made for RL in

Ready to supercharge your vision-language reasoners with scalable RL fuels? RL is promising, but good data is a bottleneck! 😤

🚀 Introducing SynthRL: A highly guaranteed and scalable pipeline to synthesize verifiable &amp; progressively harder training data, tailor-made for RL in
Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

Nice follow-up! Spurious rewards and spurious prompts re-confirm the biases cooked into Qwen base models. Revisiting our results in March (arxiv.org/pdf/2503.20783 Section 2.2 & 3.3): - No template is the best - Much of RL's gain comes from correcting model-template mismatch

Nice follow-up! Spurious rewards and spurious prompts re-confirm the biases cooked into Qwen base models. Revisiting our results in March (arxiv.org/pdf/2503.20783 Section 2.2 &amp; 3.3):
- No template is the best 
- Much of RL's gain comes from correcting model-template mismatch