@rohanpaul_ai : Qwen2.5-Math-7B-Instruct can scale to o1 level accuracy in only 32 rollouts. This paper's methods has a 4–16x better scaling rate over our deterministic search counterparts. Current inference-time scaling often relies on imperfect reward models that cause “reward hacking.” • TwiCopy

Rohan Paul

@rohanpaul_ai

+ Follow

💼 Engineer.

📚 I write daily on actionable AI developments.

🗞️ Subscribe and instantly get a 1300+page Python book → rohan-paul.com

ID: 2588345408

linkhttp://www.rohan-paul.com calendar_today25-06-2014 22:38:54

36,36K Tweet

63,63K Takipçi

780 Takip Edilen

Rohan Paul

@rohanpaul_ai

6 months ago

Qwen2.5-Math-7B-Instruct can scale to o1 level accuracy in only 32 rollouts. This paper's methods has a 4–16x better scaling rate over our deterministic search counterparts. Current inference-time scaling often relies on imperfect reward models that cause “reward hacking.”

thumb_up_off_alt151

chat_bubble_outline2

repeat34

shareShare