Dulhan Jayalath (@dulhanjay) Twitter Tweets • TwiCopy

Dulhan Jayalath

@dulhanjay

+ Follow

Reading brains with ML in PhD @UniofOxford. Formerly Student Researcher @GoogleDeepMind. All opinions stolen from more interesting people.

ID: 839537448233418757

linkhttps://dulhanjayalath.com calendar_today08-03-2017 18:05:00

9 Tweet

103 Followers

352 Following

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

2 months ago

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision "This paper asks a simple question: Can inference compute substitute for missing supervision?" "the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles

thumb_up_off_alt191

chat_bubble_outline6

repeat34

shareShare

Shashwat Goel

@shashwatgoel7

2 months ago

... and the CaT is out of the bag. Compute is all you need? Our new paper shows how to convert inference-time compute into high quality supervision for RL training. ✅ Rubric RL on non-verifiable domains ✅ No costly reference answers or rubric annotations -- the model can just

thumb_up_off_alt21

chat_bubble_outline3

repeat1

shareShare

fly51fly

@fly51fly

2 months ago

[LG] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision D Jayalath, S Goel, T Foster, P Jain... [Meta Superintelligence Labs] (2025) arxiv.org/abs/2509.14234

thumb_up_off_alt13

chat_bubble_outline0

repeat5

shareShare

Yunzhen Feng

@feeelix_feng

2 months ago

🔥 NEW PAPER: What makes reasoning traces effective in LLMs? Spoiler: It's NOT length or self-checking. We found a simple graph metric that predicts accuracy better than anything else—and proved it causally. 🧵[1/n]

thumb_up_off_alt178

chat_bubble_outline3

repeat27

shareShare

Vals AI

@_valsai

2 months ago

We are excited to have Shashwat Goel to discuss how AI evaluations need to change in tandem with LLM capabilities! Join us on alphaXiv tomorrow, Oct 2nd at 11 am PT and ask him your burning questions about evals. Link to sign up below! (1/2)

thumb_up_off_alt16

chat_bubble_outline2

repeat5

shareShare

Shashwat Goel

@shashwatgoel7

2 months ago

There's been confusion on the importance of RL after John Schulman's excellent blog showing it learns surprisingly less bits of information. Here's my blog on what we might be missing: Not all bits are made equal. Some bits of information matter more than others. This

There's been confusion on the importance of RL after <a href="/johnschulman2/">John Schulman</a>'s excellent blog showing it learns surprisingly less bits of information. Here's my blog on what we might be missing:

Not all bits are made equal.

Some bits of information matter more than others. This

thumb_up_off_alt282

chat_bubble_outline8

repeat24

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

2 months ago

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning "In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems

thumb_up_off_alt169

chat_bubble_outline11

repeat25

shareShare

Jenny Zhang

@jennyzhangzt

2 months ago

Great to see OMNI-EPIC and the Darwin Gödel Machine featured in this report! It also touches on many open-endedness topics, such as how to evaluate open-ended systems. OMNI-EPIC: arxiv.org/abs/2405.15568 Darwin Gödel Machine: arxiv.org/abs/2505.22954 (if you want to learn more)

thumb_up_off_alt24

chat_bubble_outline3

repeat4

shareShare

Sumeet Motwani

@sumeetrm

2 months ago

🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over

thumb_up_off_alt273

chat_bubble_outline10

repeat47

shareShare

Yunzhen Feng

@feeelix_feng

2 months ago

Current GRPO wastes compute on negative groups — when all K samples are wrong, you get zero gradient despite full generation cost. We propose a principled fix by bridging reward modeling and policy optimization: 👉 Penalize highly confident wrong answers more to create signal.🧵

thumb_up_off_alt341

chat_bubble_outline7

repeat41

shareShare

Rishabh Agarwal

@agarwl_

2 months ago

Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best

thumb_up_off_alt319

chat_bubble_outline11

repeat27

shareShare