Ameya P. (@amyprb) Twitter Tweets • TwiCopy

Shashwat Goel

6 months ago

I realized much of the disagreement is about "what is the right baseline". If the claim is "<our RL> improves reasoning". The null hypothesis is RL just tuned the model to your specific inference hparams, so it's important to show changing hparams doesn't give the same gains.

thumb_up_off_alt21

chat_bubble_outline2

repeat1

shareShare

Xinyu Zhu

@tianhongzxy

6 months ago

🔥The debate’s been wild: How does the reward in RLVR actually improve LLM reasoning?🤔 🚀Introducing our new paper👇 💡TL;DR: Just penalizing incorrect rollouts❌ — no positive reward needed — can boost LLM reasoning, and sometimes better than PPO/GRPO! 🧵[1/n]

thumb_up_off_alt401

chat_bubble_outline6

repeat58

shareShare

alz

@alz_zyd_

6 months ago

IMO, a major driver of PhD attrition is: do you have the willpower to finish a paper A full published econ/finance paper is like well over 100 pages of work after revisions and such, it's longer than the longest thing many students have ever done at that point in their lives

thumb_up_off_alt468

chat_bubble_outline21

repeat19

shareShare

Ameya P.

@amyprb

6 months ago

Really loved this eval! More people should work on this-- I will be (fingers crossed)

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare

Bowen Li

@bowenli2121

6 months ago

🤔 Have we really made great progress on software engineering tasks? 🚀 Introducing SWE-bench-Live, a live-updatable benchmark for real-world bug fixing. 😺 Even the best combo, OpenHands + Claude 3.7 Sonnet, sees a major performance drop! 👉 swe-bench-live.github.io 🧵 1/4

thumb_up_off_alt82

chat_bubble_outline6

repeat23

shareShare

Alexander Doria

@dorialexander

6 months ago

Announcing the release of the official Common Corpus paper: a 20 page report detailing how we collected, processed and published 2 trillion tokens of reusable data for LLM pretraining.

thumb_up_off_alt381

chat_bubble_outline11

repeat88

shareShare

Timothy B. Lee

@binarybits

6 months ago

Thinking that mechanistic interpretability is the key to understanding AI safety is like thinking that neuroscience is the key to understanding political science.

thumb_up_off_alt200

chat_bubble_outline23

repeat22

shareShare

Zihao Zhou

@zihaozhou_

6 months ago

Recently, I saw the papers "rl on one sample" and "spurious reward". The findings are interesting, but they are indeed expected. In fact, the math solving ability of the Qwen models is really easy to activate—𝐞𝐯𝐞𝐧 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐚𝐧𝐲 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 !🤣 I'd like to share

thumb_up_off_alt194

chat_bubble_outline3

repeat18

shareShare

Ameya P.

@amyprb

6 months ago

Yep strong agree - 'having some way to consolidate experience is good. In the past, continual learning research never had a good response to “just retrain bro”, but maybe this will change in the age of experience'

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Ameya P.

@amyprb

6 months ago

Forecasting has a lot of potential, but lots of open problems to be tackled to get there. Really good reference for what critical problems it faces 👇

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Chris

@chatgpt21

6 months ago

Elon Musk’s recent Twitter meltdown over the $2.4 trillion infrastructure bill seems contradictory given his own AGI timeline. If Musk genuinely believes AI will surpass human intelligence by late 2025, then he must also acknowledge the unprecedented productivity boom AGI would

thumb_up_off_alt378

chat_bubble_outline63

repeat27

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

6 months ago

Chat is this true?

thumb_up_off_alt14

chat_bubble_outline3

repeat2

shareShare

Rohan Paul

@rohanpaul_ai

6 months ago

Qwen3 Embedding Paper Embedding and reranking pipelines still struggle with task diversity and multilingual consistency. This report introduces a multi-stage training framework that uses Qwen3 LLMs to synthesize rich, multilingual relevance data, followed by contrastive

thumb_up_off_alt233

chat_bubble_outline2

repeat46

shareShare

Eric W. Tramel

@fujikanaeda

6 months ago

If you need some synthetic personas/people for your data projects, my team just put out a CC-BY-4.0 open dataset on HF of 100k folks matching US-census distributions for your free use. nvidia/Nemotron-Personas

thumb_up_off_alt186

chat_bubble_outline10

repeat21

shareShare

Dwarkesh Patel

@dwarkesh_sp

6 months ago

New episode with Kenneth S Rogoff, former chief economist of the IMF. Ken predicts that, within the next decade, the US will have a debt-induced inflation crisis, but not a Japan-type financial crisis (the latter is much worse, and can make a country poorer for generations). Ken

thumb_up_off_alt782

chat_bubble_outline28

repeat85

shareShare

Kording Lab 🦖

@kordinglab

6 months ago

I missed this when it came out and I think it is important. It seems that generally AI systems are not stress tested enough before publication/ press releases!

thumb_up_off_alt17

chat_bubble_outline1

repeat4

shareShare

Sebastian Dziadzio

@sbdzdz

6 months ago

I'm in Nashville for CVPR and wow, the Music City name is not exaggerated. If you're around, we'll be presenting our work on temporal model merging with Vishaal Udandarao✈️CVPR'25, Karsten Roth, and Ameya P. (at CVPR) on Saturday 5-7 pm in ExHall D (poster #445). Come say hi!

thumb_up_off_alt16

chat_bubble_outline4

repeat5

shareShare

Benjamin Todd

@ben_j_todd

6 months ago

Why can AIs code for 1h but not 10h? A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is: 1h: 53% 4h: 8% 10h: 0.002% Toby Ord has tested this 'constant error rate' theory and shown it's a good fit for the data chance of

Why can AIs code for 1h but not 10h?

A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is:

1h: 53%
4h: 8%
10h: 0.002%

<a href="/tobyordoxford/">Toby Ord</a> has tested this 'constant error rate' theory and shown it's a good fit for the data

chance of

thumb_up_off_alt1,1K

chat_bubble_outline71

repeat150

shareShare

kalomaze

@kalomaze

6 months ago

moonshot doing as they do

thumb_up_off_alt174

chat_bubble_outline6

repeat4

shareShare

anton

@atroyn

6 months ago

my maybe most heretical startup opinion is more founders should quit sooner and do something else. yeah yc loves to talk about how long it took airbnb to take off but we don’t have the counterfactual of the better company chesky could have built instead if he quit.

thumb_up_off_alt169

chat_bubble_outline6

repeat3

shareShare