Doyoung Kim (@doyoungkim_ml) 's Twitter Profile
Doyoung Kim

@doyoungkim_ml

Incoming CS PhD @NYU_Courant; Prieviously MS & BS @kaist_ai; General intelligence in Language ∪ Robotics

ID: 1548921016562229248

linkhttps://doyoungkim-ml.github.io/ calendar_today18-07-2022 06:42:22

405 Tweet

313 Followers

529 Following

Yunhao (Robin) Tang (@robinphysics) 's Twitter Profile Photo

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL.

This implementation, however, is quite common in open source RL repos and recent research papers.

In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.
World Labs (@theworldlabs) 's Twitter Profile Photo

Generate persistent 3D worlds from a single image, bigger and better than ever! We’re excited to share our latest results and invite you to try out our world generation model in a limited beta preview.

Eric Pang (@_eric_pang_) 's Twitter Profile Photo

Here's how I (almost) got the high scores in ARC-AGI-1 and 2 (the honor goes to Jeremy Berman) while keeping the cost low. To put things into perspective: o3-preview scored 75.7% on ARC-AGI-1 last year while spending $200/task on low setting. My approach scores 77.1% while spending

Galaxea (@galaxea_x) 's Twitter Profile Photo

Today, we’re releasing G0’s base code—covering everything from data & training to deployment & evaluation. It closes the full loop for end-to-end robotic agent R&D. Join Galaxea Dev Challenge! 🔗: github.com/OpenGalaxea/G0

Today, we’re releasing G0’s base code—covering everything from data & training to deployment & evaluation. It closes the full loop for end-to-end robotic agent R&D.

Join Galaxea Dev  Challenge!

🔗: github.com/OpenGalaxea/G0
Jean Kaddour (@jeankaddour) 's Twitter Profile Photo

Stop overfitting to GSM8K! Reasoning Gym - 100+ RL envs for LLM RL - got accepted to NeurIPS as Spotlight! Frontier LLMs still struggle with many hard env configs. arxiv.org/abs/2505.24760 github.com/open-thought/r…

Stop overfitting to GSM8K!

Reasoning Gym - 100+ RL envs for LLM RL - got accepted to NeurIPS as Spotlight! 

Frontier LLMs still struggle with many hard env configs.

arxiv.org/abs/2505.24760
github.com/open-thought/r…
Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Small models as the new frontier and why this may be academia's LLM moment Academia should reject the nihilism of "scale is all you need", i.e, that meaningful research requires frontier scale compute. This mindset hurts basic research and what we can contribute to machine

Skild AI (@skildai) 's Twitter Profile Photo

We built a robot brain that nothing can stop. Shattered limbs? Jammed motors? If the bot can move, the Brain will move it— even if it’s an entirely new robot body. Meet the omni-bodied Skild Brain:

Sakana AI (@sakanaailabs) 's Twitter Profile Photo

We’re excited to introduce ShinkaEvolve: An open-source framework that evolves programs for scientific discovery with unprecedented sample-efficiency. Blog: sakana.ai/shinka-evolve/ Code: github.com/SakanaAI/Shink… Like AlphaEvolve and its variants, our framework leverages LLMs to

anandmaj (@almondgodd) 's Twitter Profile Photo

I spent the past month reimplementing DeepMind’s Genie 3 world model from scratch Ended up making TinyWorlds, a 3M parameter world model capable of generating playable game environments demo below + everything I learned in thread (full repo at the end)👇🏼

hyunji amy lee (@hyunji_amy_lee) 's Twitter Profile Photo

🧐 LLMs aren’t great at judging their own correctness. ❗But history across models helps! We present Generalized Correctness Models (GCMs), which learn to predict correctness based on history, outperforming model-specific correctness and larger models' self-confidence.

Ernest Ryu (@ernestryu) 's Twitter Profile Photo

There’s chatter about frontier labs having a secret super-advanced-GRPO. But let me tell you something new about GRPO; the clipping mechanisms induce entropy biases: - clip-low increases entropy - clip-high decreases entropy (1/5)

Jubayer Ibn Hamid (@jubayer_hamid) 's Twitter Profile Photo

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks
Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

🚀Ever wondered how to make RL work on impossible hard tasks where pass@k = 0%? 🤔 In our new work, we share the RL Grokking Recipe: a training recipe that enables LLMs to solve previously unsolvable coding problems! I will be at #CoLM2025 next week so happy to chat about it!

🚀Ever wondered how to make RL work on impossible hard tasks where pass@k = 0%? 🤔

In our new work, we share the RL Grokking Recipe: a training recipe that enables LLMs to solve previously unsolvable coding problems! I will be at #CoLM2025 next week so happy to chat about it!
Dmitry Rybin (@dmitryrybin1) 's Twitter Profile Photo

GRPO is not frontier and is broken in so many ways i don’t even know where to start. ~50% of GRPO budget is wasted on too easy/too difficult tasks (advantage = 0) This work fixes it:

Sherry Yang (@sherryyangml) 's Twitter Profile Photo

We have developed a suite of tasks for evaluating policies in a world model which we call WorldGym. We use WorldGym to compare a few different policies (OpenVLA, Octo, RT-1-X).

Kawin Ethayarajh (@ethayarajh) 's Twitter Profile Photo

Why do PPO and GRPO work so well? Might they have deep connections to how humans perceive the world? Yes! And by understanding these connections, we can help close the gap between online and offline alignment. 🧵

Why do PPO and GRPO work so well?

Might they have deep connections to how humans perceive the world?

Yes! And by understanding these connections, we can help close the gap between online and offline alignment. 🧵
Daniel Khashabi 🕊️ (@danielkhashabi) 's Twitter Profile Photo

ICL and SFT are the two most studied ways to adapt LMs. We understand each in isolation — but far less about how they might 𝗰𝗼𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗼𝗻𝗲 𝗮𝗻𝗼𝘁𝗵𝗲𝗿. Our latest work asks two questions: 1️⃣ Do ICL and SFT operate differently? 2️⃣ And if so, can one

Yulu Gan (@yule_gan) 's Twitter Profile Photo

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for

Jackson Atkins (@jacksonatkinsx) 's Twitter Profile Photo

My brain broke when I read this paper. A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2. It's called Tiny Recursive Model (TRM) from Samsung. How can a model 10,000x smaller be smarter? Here's how

My brain broke when I read this paper.

A tiny 7 Million parameter model just beat DeepSeek-R1, Gemini 2.5 pro, and o3-mini at reasoning on both ARG-AGI 1 and ARC-AGI 2.

It's called Tiny Recursive Model (TRM) from Samsung.

How can a model 10,000x smaller be smarter?

Here's how
Ankit Goyal (@imankitgoyal) 's Twitter Profile Photo

What's the right architecture for a VLA? VLM + custom action heads (π₀)? VLM with special discrete action tokens (OpenVLA)? Custom design on top of the VLM (OpenVLA-OFT)? Or... VLM with ZERO modifications? Just predict action as text. The results will surprise you. VLA-0: