Allen Nie (🇺🇦☮️) (@allen_a_nie) Twitter Tweets • TwiCopy

John Yang

4 months ago

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

thumb_up_off_alt638

chat_bubble_outline25

repeat132

shareShare

Jiayi Pan

@jiayi_pirate

4 months ago

Hieu Pham My friend and lab mate Ruiqi Zhong has a great blog post reflecting many PhD students' thoughts on this: ruiqizhong.substack.com/p/is-a-phd-on-…

thumb_up_off_alt131

chat_bubble_outline1

repeat9

shareShare

Allen Nie (🇺🇦☮️)

@allen_a_nie

4 months ago

I think more and more people will realize RL is RL — Deep RL (gradient-based) is a particular solution for RL. Focus on the problem — solve it however you want.

thumb_up_off_alt34

chat_bubble_outline1

repeat3

shareShare

Joseph Suarez (e/🐡)

@jsuarez5341

3 months ago

There are no intuitions about what is going on here. MDPs are a bad model for real RL problems, and even for most toy ones. RL is hard to explain because your data comes from interacting with an environment. Non-stationary + hard to make fast!

thumb_up_off_alt256

chat_bubble_outline9

repeat12

shareShare

Ching-An Cheng (Hiring 2025 intern)

@chinganc_rl

3 months ago

We're organizing workshops on Programmatic Representation for Agent Learning at the upcoming #ICML2025 and #RLC2025. We welcome contributions using programs as policies, reward functions, skill libraries, task generators, environment models, etc., and more! See you soon!😀

thumb_up_off_alt7

chat_bubble_outline1

repeat2

shareShare

Brando Miranda

@brandohablando

3 months ago

One of our newest pre-training projects was built with Marin! Stay tuned for more soon! Thanks for Elyas Obbad & David Hall for being so fun to work with -- and Percy Liang help test Marin & Sanmi Koyejo really good kind advice. & Rylan Schaeffer for his very efficient feedback ;)

thumb_up_off_alt12

chat_bubble_outline0

repeat2

shareShare

Allan Zhou

@allanzhou17

3 months ago

How should we order training examples? In a new blogpost (w/ Yiding Jiang), we explore a compression-based perspective: order your dataset to minimize its prequential codelength.

thumb_up_off_alt28

chat_bubble_outline0

repeat2

shareShare

Anshul Kundaje (anshulkundaje@bluesky)

@anshulkundaje

3 months ago

Nice example of low hanging fruit connecting the dots. Some of the comments with links to papers suggest this is not a denovo discovery & may have been obvious in hindsight. But a lot of things are obvious in hindsight. 1/

thumb_up_off_alt199

chat_bubble_outline6

repeat14

shareShare

Luca Viano

@lucaviano4

3 months ago

Our new preprint is online ! Structural assumptions on the MDP helps in imitation learning, even if offline :) Joint work with Gergely Neu and Antoine Moulin 😎

thumb_up_off_alt22

chat_bubble_outline0

repeat4

shareShare

Csordás Róbert

@robert_csordas

3 months ago

Your language model is wasting half of its layers to just refine probability distributions rather than doing interesting computations. In our paper, we found that the second half of the layers of the Llama 3 models have minimal effect on future computations. 1/6

thumb_up_off_alt1,1K

chat_bubble_outline32

repeat139

shareShare

Allen Nie (🇺🇦☮️)

@allen_a_nie

3 months ago

I think this type of “tiny”-training is another viable path to make RL work on real world tasks.

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Shunyu Yao

@shunyuyao12

3 months ago

Tech is overestimated in the short term, (because infra is so damn harder than people realize) And underestimated in the long run. (becuase new tech becomes infra for new applications) Applies for computer, chip, internet, llm, rl, etc.

thumb_up_off_alt101

chat_bubble_outline4

repeat5

shareShare

Andrea Zanette

@zanette_ai

3 months ago

Can Large Reasoning Models Self-Train? We propose Self-Rewarded Training (SRT)—where LLMs generate their own supervision. Main findings: SRT initially matches RL on ground truth, but sustained training risks reward hacking. We also investigate mitigation strategies.

thumb_up_off_alt52

chat_bubble_outline1

repeat7

shareShare

Anne Ouyang

@anneouyang

3 months ago

✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6) [🔗 link in final post]

thumb_up_off_alt776

chat_bubble_outline25

repeat95

shareShare

Allen Nie (🇺🇦☮️)

@allen_a_nie

3 months ago

I'm experiencing the 🤩 moment when an amazing company just built their new library on top of the framework I helped build! Trace is a library for creating **extremely flexible** LLM-based workflows. Syftr uses Trace to optimize their workflow and push the cost-accuracy Pareto

thumb_up_off_alt32

chat_bubble_outline1

repeat1

shareShare

Lucy Li

@lucy3_li

3 months ago

"Tell, Don't Show" was accepted to #ACL2025 Findings! Our simple approach for literary topic modeling combines the new (language models) with the old (classic LDA) to yield better topics. A possible addition to your CSS/DH research 🛠️ box ✨📚 arxiv.org/abs/2505.23166

thumb_up_off_alt126

chat_bubble_outline4

repeat19

shareShare

Allen Nie (🇺🇦☮️)

@allen_a_nie

3 months ago

I'm onboarding a research dev from France for Trace today with Ching-An and Adith. None of us knew him before. He just shipped, built, and impressed everyone 😅

thumb_up_off_alt14

chat_bubble_outline1

repeat0

shareShare

Alexander Terenin

@avt_im

3 months ago

We've got a major update to our preprint on adversarial regret guarantees for Thompson sampling! As before, I think this is one of the most important projects I've worked on due to new algorithmic primitives that it - in principle - unlocks. Thread below on what's new!

thumb_up_off_alt78

chat_bubble_outline1

repeat6

shareShare

Haitham Bou Ammar

@hbouammar

3 months ago

I read this paper in detail, and I am very sad! They literally re-do the optimal reward baseline work that we have known since forever, without even crediting the true authors in their derivations. The third screenshot is taken from: ieeexplore.ieee.org/stamp/stamp.js… As you see, they

thumb_up_off_alt99

chat_bubble_outline5

repeat10

shareShare

Nan Jiang

@nanjiang_cs

3 months ago

Given the sheer number of ppl interested in PG methods nowadays I'm sure innocent "rediscoveries" like this are happening everyday. Otoh, due diligence takes minimal effort today as you can just DeepResearch. All it takes is the sense/taste to ask "no way this is not done b4"...

thumb_up_off_alt33

chat_bubble_outline0

repeat4

shareShare