Muhammad Khalifa (@mkhalifaaaa) Twitter Tweets • TwiCopy

Muhammad Khalifa

@mkhalifaaaa

+ Follow

Ph.D. student at @Umich, previously @cohere, @allenai and @AmazonScience, and @NaverLabsEurope. Reasoning with LLMs, attribution and other stuff.

ID: 1106539012079128577

linkhttp://mukhal.github.io calendar_today15-03-2019 12:53:53

298 Tweet

533 Followers

472 Following

Vishaal Udandarao

@vishaal_urao

5 months ago

🚀New Paper! arxiv.org/abs/2504.07086 Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress? We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀 🧵👇

thumb_up_off_alt264

chat_bubble_outline4

repeat53

shareShare

Muhammad Khalifa

@mkhalifaaaa

5 months ago

I gave a talk last week at DLCT ML Collective on my work cohere on model merging of 100B LLMs. Thanks Rosanne Liu and Jason Yosinski for hosting me! Here are the slides: docs.google.com/presentation/d… Paper: arxiv.org/abs/2412.04144

I gave a talk last week at DLCT <a href="/ml_collective/">ML Collective</a> on my work <a href="/cohere/">cohere</a> on model merging of 100B LLMs. Thanks <a href="/savvyRL/">Rosanne Liu</a> and <a href="/jasonyo/">Jason Yosinski</a> for hosting me!

Here are the slides: docs.google.com/presentation/d…

Paper: arxiv.org/abs/2412.04144

thumb_up_off_alt27

chat_bubble_outline1

repeat3

shareShare

Yunxiang Zhang

@yunxiangzhang4

4 months ago

🚨 New Benchmark Drop! Can LLMs actually do ML research? Not toy problems, not Kaggle tweaks—but real, unsolved ML conference research competitions? We built MLRC-BENCH to find out. Paper: arxiv.org/abs/2504.09702 Leaderboard: huggingface.co/spaces/launch/… Code: github.com/yunx-z/MLRC-Be…

thumb_up_off_alt102

chat_bubble_outline3

repeat35

shareShare

Ayoung Lee

@o_cube01

4 months ago

📢New benchmark out! We introduce CLASH, a benchmark of 345💥high-stakes dilemmas and 3,795 perspectives to evaluate how well LLMs handle complex value reasoning. GPT-4 and Claude? Not quite there. 📄 arxiv.org/pdf/2504.10823 🤗 huggingface.co/datasets/launc…

thumb_up_off_alt80

chat_bubble_outline3

repeat24

shareShare

Honglak Lee

@honglaklee

4 months ago

LLMs can reason step-by-step — but how do we verify each step? Process Reward Models (PRMs) can do step-level verification, but training them requires expensive, large-scale supervision. To improve efficiency, we introduce ThinkPRM: - Verifies reasoning via long CoT

thumb_up_off_alt15

chat_bubble_outline0

repeat2

shareShare

Muhammad Khalifa

@mkhalifaaaa

4 months ago

A nice summary of our new paper on generative PRMs 👇

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Rishabh Agarwal

@agarwl_

4 months ago

Going beyond verifiable domains, we still need reward models, which will likely be generative verifiers! Recent papers along this direction: 1. Scaling RL with RMs on "synthetic" prompts @ ICML25 2. Step by Step Verifiers That Think -- Better perf than PRM800K with 1K labels

thumb_up_off_alt260

chat_bubble_outline16

repeat30

shareShare

Rohan Paul

@rohanpaul_ai

4 months ago

Training Process Reward Models (PRMs) for step-by-step verification requires extensive, expensive step-level labels, while simple LLM-as-a-Judge approaches perform poorly. This paper proposes THINKPRM, a PRM fine-tuned on minimal labels (8K), which verifies solution steps by

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare

Muhammad Khalifa

@mkhalifaaaa

4 months ago

Glad to see ThinkPRM being featured on different AI newsletters

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Rishabh Agarwal

@agarwl_

4 months ago

Idea: Merging generative verification and solution generarion during RL training of LLM reasoners. Why? This allows you to scale inference compute both sequentially (long CoT) and parallel (Best of N, weighted maj voting). Next? Generation and verification to be trained end

thumb_up_off_alt164

chat_bubble_outline1

repeat24

shareShare

Muhammad Khalifa

@mkhalifaaaa

3 months ago

RL for training generative judges is promising but one issue i personally observed is that reward hacking is super common when the task is binary Why not train a really good reasoning model with RLVR, then minimally SFT it as verifier? ThinkPRM: arxiv.org/abs/2504.16828

thumb_up_off_alt11

chat_bubble_outline1

repeat1

shareShare

Wenhu Chen

@wenhuchen

3 months ago

We are super excited to announce Verl-Tool, which is a user-friendly framework to support diverse types of agentic training with RL. github.com/TIGER-AI-Lab/v… Now we have supported Code-Interpreter, Pixel Operations, Browser, and Bash. If you need to support your tool or

thumb_up_off_alt191

chat_bubble_outline2

repeat22

shareShare

Muhammad Khalifa

@mkhalifaaaa

3 months ago

Simple yet cool idea. I find it interesting how the community is now cares more about pass@k than pass@1 eval which dominated the field over the last 5-6 months

thumb_up_off_alt8

chat_bubble_outline0

repeat2

shareShare

Muhammad Khalifa

@mkhalifaaaa

3 months ago

One cool use of generative step-by-step verifiers is analyzing your RL-trained model outputs Having fun with LM Studio + ThinkPRM-14B 👇

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare