Arian Hosseini (@ariantbd) Twitter Tweets • TwiCopy

Thrilled to share that I recently joined Google DeepMind I’ll be working on LLM RL reasoning. Exciting times ahead and I'm eager to collaborate with and learn from the brilliant minds here.

Thrilled to share that I recently joined <a href="/GoogleDeepMind/">Google DeepMind</a>

I’ll be working on LLM RL reasoning. Exciting times ahead and I'm eager to collaborate with and learn from the brilliant minds here.

thumb_up_off_alt943

chat_bubble_outline49

repeat20

shareShare

Is BIG-Bench Hard too easy for your LLM? We just unleashed BIG-Bench EXTRA Hard (BBEH)! 😈 Every task, harder! Every model, humbled! (Poem Credit: Gemini 2.0 Flash) Massive headroom for progress across various areas in general reasoning 🤯

thumb_up_off_alt232

chat_bubble_outline8

repeat34

shareShare

Arian Hosseini

@ariantbd

9 months ago

Happy to see a flavor 🧂 of Compositional GSM in this work Comp GSM: arxiv.org/abs/2410.01748 BBEH: arxiv.org/abs/2502.19187

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare

Reyhane Askari

@reyhaneaskari

9 months ago

🚀 New Paper Alert! Can we generate informative synthetic data that truly helps a downstream learner? Introducing Deliberate Practice for Synthetic Data (DP)—a dynamic framework that focuses on where the model struggles most to generate useful synthetic training examples. 🔥

thumb_up_off_alt281

chat_bubble_outline3

repeat66

shareShare

Edward Grefenstette

@egrefen

9 months ago

Our team in London is hiring! If you want to come work with a wonderful group of researchers on investigating the frontiers of autonomous open-ended agents that help humans be better at doing things we love, come have a look. Link in post below 👇

thumb_up_off_alt292

chat_bubble_outline10

repeat36

shareShare

Arian Hosseini

@ariantbd

8 months ago

Test-time scaling with GenRMs being “more efficient” than self-consistency (SC) is misleading 😶‍🌫️ In our new work, in a compute-matched setup, we show that SC outperforms GenRM across most budgets! 💰 Check out details/code and the paper here : arxiv.org/abs/2504.01005

thumb_up_off_alt64

chat_bubble_outline2

repeat26

shareShare

Arian Hosseini

@ariantbd

8 months ago

this pretty much sums up the paper

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Arian Hosseini

@ariantbd

8 months ago

At higher budgets GenRM beats SC, but a question remains: How to allocate the budget to generate more solutions or verifications? We derive scaling laws for optimal # of solutions and verifications. Showing that # of solution should grow 1.5-2x faster than # of verifications

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Nishad Singhi

@nishadsinghi

8 months ago

We've released the models and data for this paper on huggingface 🤗: huggingface.co/sc-genrm-scali… 👉 Fine-tuned generative verifiers 👉 GPT-4o generated synthetic data for fine-tuning your own verifiers 👉 Verifications from lots of models (incl. QwQ-32B) on lots of datasets!

thumb_up_off_alt8

chat_bubble_outline1

repeat2

shareShare

Arian Hosseini

@ariantbd

8 months ago

is there a study on what sort of concepts should we let LLMs learn by themselves rather than injecting our preferences or inductive biases? For instance, Is CoT reasoning the kind of thing that needs (language) preference/bias?

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Arian Hosseini

@ariantbd

7 months ago

After a lovely week away from LLMs in Vietnam, I’ll be at ICLR 2026 🇸🇬 Will be at posters for GenRM and Training LLM Reasoners via Compute-Optimal Sampling . Also Google DeepMind booth on 25th (and more). Come chat about LLM Reasoning and RL, or for Vietnam recommendations

After a lovely week away from LLMs in Vietnam, I’ll be at <a href="/iclr_conf/">ICLR 2026</a> 🇸🇬

Will be at posters for GenRM and Training LLM Reasoners via Compute-Optimal Sampling . Also <a href="/GoogleDeepMind/">Google DeepMind</a> booth on 25th (and more). Come chat about LLM Reasoning and RL, or for Vietnam recommendations

thumb_up_off_alt31

chat_bubble_outline0

repeat0

shareShare

Arian Hosseini

@ariantbd

7 months ago

Great work by Kusha Sareen and morgane Paper : huggingface.co/papers/2505.04…

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Mehran Kazemi

@kazemi_sm

7 months ago

Upon some requests, we now have a BBEH Mini with 460 examples (20 per task) for faster and cheaper experimentation. The set can be downloaded from: github.com/google-deepmin… The results are reported in Table 3 of arxiv.org/pdf/2502.19187

thumb_up_off_alt31

chat_bubble_outline0

repeat7

shareShare

Rishabh Agarwal

@agarwl_

7 months ago

Idea: Merging generative verification and solution generarion during RL training of LLM reasoners. Why? This allows you to scale inference compute both sequentially (long CoT) and parallel (Best of N, weighted maj voting). Next? Generation and verification to be trained end

thumb_up_off_alt164

chat_bubble_outline1

repeat24

shareShare

Jason Weston

@jaseweston

6 months ago

🚨Announcing RAM 2 workshop @ COLM25 - call for papers🚨 - 10 years on, we present the sequel to the classic RAM🐏 (Reasoning, Attention, Memory) workshop that took place in 2015 at the cusp of major change in the area. Now in 2025 we reflect on what's happened and discuss the

thumb_up_off_alt111

chat_bubble_outline2

repeat29

shareShare

Arian Hosseini

@ariantbd

6 months ago

Says more about MATH500 and AMC than RL

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare