Dulhan Jayalath (@dulhanjay) 's Twitter Profile
Dulhan Jayalath

@dulhanjay

Reading brains with ML in PhD @UniofOxford. Formerly Student Researcher @GoogleDeepMind. All opinions stolen from more interesting people.

ID: 839537448233418757

linkhttps://dulhanjayalath.com calendar_today08-03-2017 18:05:00

9 Tweet

103 Followers

352 Following

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision "This paper asks a simple question: Can inference compute substitute for missing supervision?" "the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

"This paper asks a simple question: Can inference compute substitute for missing supervision?"

"the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

... and the CaT is out of the bag. Compute is all you need? Our new paper shows how to convert inference-time compute into high quality supervision for RL training. ✅ Rubric RL on non-verifiable domains ✅ No costly reference answers or rubric annotations -- the model can just

... and the CaT is out of the bag. Compute is all you need?

Our new paper shows how to convert inference-time compute into high quality supervision for RL training.

✅ Rubric RL on non-verifiable domains
✅ No costly reference answers or rubric annotations -- the model can just
fly51fly (@fly51fly) 's Twitter Profile Photo

[LG] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision D Jayalath, S Goel, T Foster, P Jain... [Meta Superintelligence Labs] (2025) arxiv.org/abs/2509.14234

[LG] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
D Jayalath, S Goel, T Foster, P Jain... [Meta Superintelligence Labs] (2025)
arxiv.org/abs/2509.14234
Yunzhen Feng (@feeelix_feng) 's Twitter Profile Photo

🔥 NEW PAPER: What makes reasoning traces effective in LLMs? Spoiler: It's NOT length or self-checking. We found a simple graph metric that predicts accuracy better than anything else—and proved it causally. 🧵[1/n]

🔥 NEW PAPER: What makes reasoning traces effective in LLMs? Spoiler: It's NOT length or self-checking. We found a simple graph metric that predicts accuracy better than anything else—and proved it causally. 🧵[1/n]
Vals AI (@_valsai) 's Twitter Profile Photo

We are excited to have Shashwat Goel to discuss how AI evaluations need to change in tandem with LLM capabilities! Join us on alphaXiv tomorrow, Oct 2nd at 11 am PT and ask him your burning questions about evals. Link to sign up below! (1/2)

We are excited to have <a href="/ShashwatGoel7/">Shashwat Goel</a> to discuss how AI evaluations need to change in tandem with LLM capabilities! 

Join us on <a href="/askalphaxiv/">alphaXiv</a> tomorrow, Oct 2nd at 11 am PT and ask him your burning questions about evals. 

Link to sign up below!
(1/2)
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

There's been confusion on the importance of RL after John Schulman's excellent blog showing it learns surprisingly less bits of information. Here's my blog on what we might be missing: Not all bits are made equal. Some bits of information matter more than others. This

There's been confusion on the importance of RL after <a href="/johnschulman2/">John Schulman</a>'s excellent blog showing it learns surprisingly less bits of information. Here's my blog on what we might be missing:

Not all bits are made equal.

Some bits of information matter more than others. This
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning "In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

"In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems
Jenny Zhang (@jennyzhangzt) 's Twitter Profile Photo

Great to see OMNI-EPIC and the Darwin Gödel Machine featured in this report! It also touches on many open-endedness topics, such as how to evaluate open-ended systems. OMNI-EPIC: arxiv.org/abs/2405.15568 Darwin Gödel Machine: arxiv.org/abs/2505.22954 (if you want to learn more)

Sumeet Motwani (@sumeetrm) 's Twitter Profile Photo

🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over

🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data?

Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡

&gt; RL on existing datasets saturates very quickly
&gt; Reasoning over
Yunzhen Feng (@feeelix_feng) 's Twitter Profile Photo

Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost. We propose a principled fix by bridging reward modeling and policy optimization: 👉 Penalize highly confident wrong answers more to create signal.🧵

Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost.

We propose a principled fix by bridging reward modeling and policy optimization:
👉 Penalize highly confident wrong answers more to create signal.🧵
Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best

Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. 

Coincidentally, this is similar motivation to what we had for the NeurIPS best