Muhammad Khalifa (@mkhalifaaaa) 's Twitter Profile
Muhammad Khalifa

@mkhalifaaaa

Ph.D. student at @Umich, previously @cohere, @allenai and @AmazonScience, and @NaverLabsEurope. Reasoning with LLMs, attribution and other stuff.

ID: 1106539012079128577

linkhttp://mukhal.github.io calendar_today15-03-2019 12:53:53

298 Tweet

533 Takipçi

472 Takip Edilen

Vishaal Udandarao (@vishaal_urao) 's Twitter Profile Photo

🚀New Paper! arxiv.org/abs/2504.07086 Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress? We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀 🧵👇

🚀New Paper!
arxiv.org/abs/2504.07086

Everyone’s celebrating rapid progress in math reasoning with RL/SFT. But how real is this progress?

We re-evaluated recently released popular reasoning models—and found reported gains often vanish under rigorous testing!! 👀

🧵👇
Muhammad Khalifa (@mkhalifaaaa) 's Twitter Profile Photo

I gave a talk last week at DLCT ML Collective on my work cohere on model merging of 100B LLMs. Thanks Rosanne Liu and Jason Yosinski for hosting me! Here are the slides: docs.google.com/presentation/d… Paper: arxiv.org/abs/2412.04144

I gave a talk last week at DLCT <a href="/ml_collective/">ML Collective</a> on my work <a href="/cohere/">cohere</a> on model merging of 100B LLMs. Thanks <a href="/savvyRL/">Rosanne Liu</a> and <a href="/jasonyo/">Jason Yosinski</a> for hosting me!

Here are the slides: docs.google.com/presentation/d…

Paper: arxiv.org/abs/2412.04144
Yunxiang Zhang (@yunxiangzhang4) 's Twitter Profile Photo

🚨 New Benchmark Drop! Can LLMs actually do ML research? Not toy problems, not Kaggle tweaks—but real, unsolved ML conference research competitions? We built MLRC-BENCH to find out. Paper: arxiv.org/abs/2504.09702 Leaderboard: huggingface.co/spaces/launch/… Code: github.com/yunx-z/MLRC-Be…

🚨 New Benchmark Drop!
Can LLMs actually do ML research? Not toy problems, not Kaggle tweaks—but real, unsolved ML conference research competitions?
We built MLRC-BENCH to find out.
Paper: arxiv.org/abs/2504.09702
Leaderboard: huggingface.co/spaces/launch/…
Code: github.com/yunx-z/MLRC-Be…
Ayoung Lee (@o_cube01) 's Twitter Profile Photo

📢New benchmark out! We introduce CLASH, a benchmark of 345💥high-stakes dilemmas and 3,795 perspectives to evaluate how well LLMs handle complex value reasoning. GPT-4 and Claude? Not quite there. 📄 arxiv.org/pdf/2504.10823 🤗 huggingface.co/datasets/launc…

📢New benchmark out!

We introduce CLASH, a benchmark of 345💥high-stakes dilemmas and 3,795 perspectives to evaluate how well LLMs handle complex value reasoning.

GPT-4 and Claude? Not quite there.

📄 arxiv.org/pdf/2504.10823
🤗 huggingface.co/datasets/launc…
Honglak Lee (@honglaklee) 's Twitter Profile Photo

LLMs can reason step-by-step — but how do we verify each step? Process Reward Models (PRMs) can do step-level verification, but training them requires expensive, large-scale supervision. To improve efficiency, we introduce ThinkPRM: - Verifies reasoning via long CoT

Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

Going beyond verifiable domains, we still need reward models, which will likely be generative verifiers! Recent papers along this direction: 1. Scaling RL with RMs on "synthetic" prompts @ ICML25 2. Step by Step Verifiers That Think -- Better perf than PRM800K with 1K labels

Going beyond verifiable domains, we still need reward models, which will likely be generative verifiers! Recent papers along this direction: 

1. Scaling RL with RMs on "synthetic" prompts @ ICML25

2. Step by Step Verifiers That Think -- Better perf than PRM800K with 1K labels
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Training Process Reward Models (PRMs) for step-by-step verification requires extensive, expensive step-level labels, while simple LLM-as-a-Judge approaches perform poorly. This paper proposes THINKPRM, a PRM fine-tuned on minimal labels (8K), which verifies solution steps by

Training Process Reward Models (PRMs) for step-by-step verification requires extensive, expensive step-level labels, while simple LLM-as-a-Judge approaches perform poorly.

This paper proposes THINKPRM, a PRM fine-tuned on minimal labels (8K), which verifies solution steps by
Rishabh Agarwal (@agarwl_) 's Twitter Profile Photo

Idea: Merging generative verification and solution generarion during RL training of LLM reasoners. Why? This allows you to scale inference compute both sequentially (long CoT) and parallel (Best of N, weighted maj voting). Next? Generation and verification to be trained end

Muhammad Khalifa (@mkhalifaaaa) 's Twitter Profile Photo

RL for training generative judges is promising but one issue i personally observed is that reward hacking is super common when the task is binary Why not train a really good reasoning model with RLVR, then minimally SFT it as verifier? ThinkPRM: arxiv.org/abs/2504.16828

Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

We are super excited to announce Verl-Tool, which is a user-friendly framework to support diverse types of agentic training with RL. github.com/TIGER-AI-Lab/v… Now we have supported Code-Interpreter, Pixel Operations, Browser, and Bash. If you need to support your tool or

Muhammad Khalifa (@mkhalifaaaa) 's Twitter Profile Photo

Simple yet cool idea. I find it interesting how the community is now cares more about pass@k than pass@1 eval which dominated the field over the last 5-6 months

Muhammad Khalifa (@mkhalifaaaa) 's Twitter Profile Photo

One cool use of generative step-by-step verifiers is analyzing your RL-trained model outputs Having fun with LM Studio + ThinkPRM-14B 👇

One cool use of generative step-by-step verifiers is analyzing your RL-trained model outputs

Having fun with LM Studio + ThinkPRM-14B 👇