Alexandre Ramé (@ramealexandre) 's Twitter Profile
Alexandre Ramé

@ramealexandre

Research scientist @GoogleDeepMind. Previously PhD @Sorbonne_Univ_.

Post-training Gemma LLMs: distillation, RL and merging.

ID: 300445195

linkhttps://alexrame.github.io/ calendar_today17-05-2011 19:37:18

662 Tweet

1,1K Takipçi

731 Takip Edilen

Nathan Lambert (@natolambert) 's Twitter Profile Photo

The reason recent RLVR papers show mostly formatting and not learning new skills is just because no one has scaled up enough. If RL compute is <.1% of overall compute, ofc not much changes. I bet o3 is closer to 5% of total compute. 10-25% i bet the models feel different again.

AshutoshShrivastava (@ai_for_success) 's Twitter Profile Photo

WHY IS NO ONE TALKING ABOUT THIS?? Gemma 3n model was one of the best surprises for me. The fact that you can run it on edge devices even with just 2GB of RAM is impressive. A few weeks back, I was on holiday and used the Gemini Live feature a lot. But I kept running into

YIFENG LIU (@yifengliu_ai) 's Twitter Profile Photo

1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: arxiv.org/abs/2505.17508 Code:

1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO:
Paper: arxiv.org/abs/2505.17508
Code:
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Reinforcing General Reasoning without Verifiers "we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and

Reinforcing General Reasoning without Verifiers

"we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and
Ning Ding (@stingning) 's Twitter Profile Photo

Language models are trading Entropy for Rewards in reinforcement learning, meaning the uncertainty is transforming to certainty. The trading is even quantitively predictable:   R = -a * exp(H) + b In our latest paper, we find that we should, and we can scientifically

Omar Khattab (@lateinteraction) 's Twitter Profile Photo

Sigh, it's a bit of a mess. Let me just give you guys the full nuance in one stream of consciousness since I think we'll continue to get partial interpretations that confuse everyone. All the little things I post need to always be put together in one place. First, I have long

Mustafa Shukor (@mustafashukor1) 's Twitter Profile Photo

The Worldwide LeRobot hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵

The Worldwide <a href="/LeRobotHF/">LeRobot</a>  hackathon is in 2 weeks, and we have been cooking something for you… 
Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵
Andrei Bursuc (@abursuc) 's Twitter Profile Photo

CVPR@Paris: what a fantastic and refreshing event this was: awesome talks and poster, great researchers, amazing food and fantastic organizers. Let's do it again Vicky Kalogeiton David Picard Matthieu Cord #cvprinparis #CVPR2025

CVPR@Paris: what a fantastic and refreshing event this was: awesome  talks and poster, great researchers, amazing food and fantastic  organizers. Let's do it again
<a href="/VickyKalogeiton/">Vicky Kalogeiton</a> <a href="/david_picard/">David Picard</a> <a href="/quobbe/">Matthieu Cord</a> 
#cvprinparis #CVPR2025
Qingxiu Dong (@qx_dong) 's Twitter Profile Photo

⏰ We introduce Reinforcement Pre-Training (RPT🍒) — reframing next-token prediction as a reasoning task using RLVR ✅ General-purpose reasoning 📑 Scalable RL on web corpus 📈 Stronger pre-training + RLVR results 🚀 Allow allocate more compute on specific tokens

⏰ We introduce Reinforcement Pre-Training (RPT🍒)  

 — reframing next-token prediction as a reasoning task using RLVR  

✅ General-purpose reasoning 
📑 Scalable RL on web corpus
📈 Stronger pre-training + RLVR results
🚀 Allow allocate more compute on specific tokens
Omar Sanseviero (@osanseviero) 's Twitter Profile Photo

Gemma 3n in desktop is here! 🚀 🤗Desktop (Mac/Windows/Linux) + IoT 🔥2B and 4B 🧠Powered by new LiteRT-LM library Github: github.com/google-ai-edge… Preview: huggingface.co/google/gemma-3…

Arcee.ai (@arcee_ai) 's Twitter Profile Photo

Another research report, this time from Massachusetts Institute of Technology (MIT), confirms the unique benefits of model merging! In a paper recently published on Nature.com (nature.com/articles/s4152…), researchers at the Massachusetts Institute of Technology explore a wide range of strategies to adapt

Another research report, this time from <a href="/MIT/">Massachusetts Institute of Technology (MIT)</a>, confirms the unique benefits of model merging!

In a paper recently published on Nature.com (nature.com/articles/s4152…), researchers at the Massachusetts Institute of Technology explore a wide range of strategies to adapt
Omar Sanseviero (@osanseviero) 's Twitter Profile Photo

Gemma 3n is the first model with less than 10B parameters with a LMArena score above 1300 🔥 And yes, you can run it in your phone Try it in AIS aistudio.google.com/prompts/new_ch… AI Edge: github.com/google-ai-edge…

Gemma 3n is the first model with less than 10B parameters with a LMArena score above 1300 🔥

And yes, you can run it in your phone

Try it in AIS aistudio.google.com/prompts/new_ch…
AI Edge: github.com/google-ai-edge…
Nathan Lambert (@natolambert) 's Twitter Profile Photo

A common trend across recent research in using reinforcement learning to train reasoning models is that the clipping operation within a trust region (core to PPO, adopted by GRPO) is squashing rare tokens that are key to clever behaviors like verification or backtracking. The

A common trend across recent research in using reinforcement learning to train reasoning models is that the clipping operation within a trust region (core to PPO, adopted by GRPO) is squashing rare tokens that are key to clever behaviors like verification or backtracking. The
Philipp Schmid (@_philschmid) 's Twitter Profile Photo

Gemini 2.5 is production ready! We just launched 3 new Gemini models with 2.5 Pro and Flash being now generally available and a new Gemini 2.5 Flash Lite preview! 🧠⚡️🔦 Here is all you need to know: 🔦 New Gemini 2.5 Flash Lite (Preview) with Thinking, 1M context, only

Gemini 2.5 is production ready! We just launched 3 new Gemini models with 2.5 Pro and Flash being now generally available and a new Gemini 2.5 Flash Lite preview! 🧠⚡️🔦

Here is all you need to know:
🔦 New Gemini 2.5 Flash Lite (Preview) with Thinking, 1M context, only
Google DeepMind (@googledeepmind) 's Twitter Profile Photo

Hot Gemini updates off the press. 🚀 Anyone can now use 2.5 Flash and Pro to build and scale production-ready AI applications. 🙌 We’re also launching 2.5 Flash-Lite in preview: the fastest model in the 2.5 family to respond to requests, with the lowest cost too. 🧵

Paul Couvert (@itspaulai) 's Twitter Profile Photo

Google has just released Gemini 2.5 Flash Lite This is the cheapest and fastest model available: You can literally: - Process the entire Harry Potter series for $0.22 - Analyze a 3-hour video for less than $0.35 And you can also enable thinking mode to enhance its

Google has just released Gemini 2.5 Flash Lite

This is the cheapest and fastest model available:

You can literally:

- Process the entire Harry Potter series for $0.22
- Analyze a 3-hour video for less than $0.35

And you can also enable thinking mode to enhance its