Doyoung Kim (@doyoungkim_ml) Twitter Tweets • TwiCopy

Yunhao (Robin) Tang

5 months ago

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

thumb_up_off_alt651

chat_bubble_outline13

repeat51

shareShare

World Labs

@theworldlabs

2 months ago

Generate persistent 3D worlds from a single image, bigger and better than ever! We’re excited to share our latest results and invite you to try out our world generation model in a limited beta preview.

thumb_up_off_alt3,3K

chat_bubble_outline178

repeat514

shareShare

Eric Pang

@_eric_pang_

2 months ago

Here's how I (almost) got the high scores in ARC-AGI-1 and 2 (the honor goes to Jeremy Berman) while keeping the cost low. To put things into perspective: o3-preview scored 75.7% on ARC-AGI-1 last year while spending $200/task on low setting. My approach scores 77.1% while spending

thumb_up_off_alt888

chat_bubble_outline27

repeat99

shareShare

Galaxea

@galaxea_x

2 months ago

Today, we’re releasing G0’s base code—covering everything from data & training to deployment & evaluation. It closes the full loop for end-to-end robotic agent R&D. Join Galaxea Dev Challenge！ 🔗: github.com/OpenGalaxea/G0

thumb_up_off_alt29

chat_bubble_outline0

repeat1

shareShare

Jean Kaddour

@jeankaddour

2 months ago

Stop overfitting to GSM8K! Reasoning Gym - 100+ RL envs for LLM RL - got accepted to NeurIPS as Spotlight! Frontier LLMs still struggle with many hard env configs. arxiv.org/abs/2505.24760 github.com/open-thought/r…

thumb_up_off_alt359

chat_bubble_outline5

repeat44

shareShare

Dimitris Papailiopoulos

@dimitrispapail

2 months ago

Small models as the new frontier and why this may be academia's LLM moment Academia should reject the nihilism of "scale is all you need", i.e, that meaningful research requires frontier scale compute. This mindset hurts basic research and what we can contribute to machine

thumb_up_off_alt337

chat_bubble_outline21

repeat42

shareShare

Skild AI

@skildai

2 months ago

We built a robot brain that nothing can stop. Shattered limbs? Jammed motors? If the bot can move, the Brain will move it— even if it’s an entirely new robot body. Meet the omni-bodied Skild Brain:

thumb_up_off_alt6,6K

chat_bubble_outline478

repeat917

shareShare

Sakana AI

@sakanaailabs

2 months ago

We’re excited to introduce ShinkaEvolve: An open-source framework that evolves programs for scientific discovery with unprecedented sample-efficiency. Blog: sakana.ai/shinka-evolve/ Code: github.com/SakanaAI/Shink… Like AlphaEvolve and its variants, our framework leverages LLMs to

thumb_up_off_alt1,1K

chat_bubble_outline24

repeat227

shareShare

anandmaj

@almondgodd

2 months ago

I spent the past month reimplementing DeepMind’s Genie 3 world model from scratch Ended up making TinyWorlds, a 3M parameter world model capable of generating playable game environments demo below + everything I learned in thread (full repo at the end)👇🏼

thumb_up_off_alt2,2K

chat_bubble_outline96

repeat270

shareShare

hyunji amy lee

@hyunji_amy_lee

2 months ago

🧐 LLMs aren’t great at judging their own correctness. ❗But history across models helps! We present Generalized Correctness Models (GCMs), which learn to predict correctness based on history, outperforming model-specific correctness and larger models' self-confidence.

thumb_up_off_alt32

chat_bubble_outline0

repeat15

shareShare

Ernest Ryu

@ernestryu

2 months ago

There’s chatter about frontier labs having a secret super-advanced-GRPO. But let me tell you something new about GRPO; the clipping mechanisms induce entropy biases: - clip-low increases entropy - clip-high decreases entropy (1/5)

thumb_up_off_alt264

chat_bubble_outline3

repeat23

shareShare

Jubayer Ibn Hamid

@jubayer_hamid

2 months ago

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks

thumb_up_off_alt1,1K

chat_bubble_outline16

repeat135

shareShare

Nouha Dziri

@nouhadziri

2 months ago

🚀Ever wondered how to make RL work on impossible hard tasks where pass@k = 0%? 🤔 In our new work, we share the RL Grokking Recipe: a training recipe that enables LLMs to solve previously unsolvable coding problems! I will be at #CoLM2025 next week so happy to chat about it!

thumb_up_off_alt1,1K

chat_bubble_outline23

repeat163

shareShare

Dmitry Rybin

@dmitryrybin1

2 months ago

GRPO is not frontier and is broken in so many ways i don’t even know where to start. ~50% of GRPO budget is wasted on too easy/too difficult tasks (advantage = 0) This work fixes it:

thumb_up_off_alt363

chat_bubble_outline4

repeat26

shareShare

Sherry Yang

@sherryyangml

a month ago

We have developed a suite of tasks for evaluating policies in a world model which we call WorldGym. We use WorldGym to compare a few different policies (OpenVLA, Octo, RT-1-X).

thumb_up_off_alt55

chat_bubble_outline1

repeat12

shareShare

Kawin Ethayarajh

@ethayarajh

a month ago

Why do PPO and GRPO work so well? Might they have deep connections to how humans perceive the world? Yes! And by understanding these connections, we can help close the gap between online and offline alignment. 🧵

thumb_up_off_alt268

chat_bubble_outline8

repeat38

shareShare

Daniel Khashabi 🕊️

@danielkhashabi

a month ago

ICL and SFT are the two most studied ways to adapt LMs. We understand each in isolation — but far less about how they might 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗼𝗻𝗲 𝗮𝗻𝗼𝘁𝗵𝗲𝗿. Our latest work asks two questions: 1️⃣ Do ICL and SFT operate differently? 2️⃣ And if so, can one

thumb_up_off_alt173

chat_bubble_outline1

repeat28

shareShare

Yulu Gan

@yule_gan

a month ago

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for

thumb_up_off_alt2,2K

chat_bubble_outline85

repeat352

shareShare

Jackson Atkins

@jacksonatkinsx

a month ago

My brain broke when I read this paper. A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2. It's called Tiny Recursive Model (TRM) from Samsung. How can a model 10,000x smaller be smarter? Here's how

thumb_up_off_alt11,11K

chat_bubble_outline344

repeat2,2K

shareShare

Ankit Goyal

@imankitgoyal

a month ago

What's the right architecture for a VLA? VLM + custom action heads (π₀)? VLM with special discrete action tokens (OpenVLA)? Custom design on top of the VLM (OpenVLA-OFT)? Or... VLM with ZERO modifications? Just predict action as text. The results will surprise you. VLA-0:

thumb_up_off_alt515

chat_bubble_outline17

repeat70

shareShare