Aaron Defazio (@aaron_defazio) Twitter Tweets • TwiCopy

Harsh Bhatt

2 months ago

trained with Schedulefree AdamW optimizer (right) and it is so smooth than the normal AdamW (left). here's the PR checkout how it is being used and integrated: github.com/Metta-AI/metta…

thumb_up_off_alt21

chat_bubble_outline1

repeat3

shareShare

MacStudio you ask? Apple Engineering's **actual** time spent on PyTorch support has't given me confidence that PyTorch Mac experience would get anywhere close to NVIDIA's any time soon, if ever. The Meta engineers continue to do a huge amount of heavy-lifting for improving the

thumb_up_off_alt759

chat_bubble_outline31

repeat53

shareShare

Dwarkesh Patel

@dwarkesh_sp

2 months ago

The Andrej Karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self

thumb_up_off_alt9,9K

chat_bubble_outline281

repeat1,1K

shareShare

Peyman Milanfar

@docmilanfar

2 months ago

PLEASE DON’T PLOT "SCALING LAW" THINGS ON LOG-LOG SCALE. EVERYTHING IS ~ LINEAR IN LOG-LOG SCALE. THANK YOU FOR YOUR ATTENTION TO THIS MATTER!

thumb_up_off_alt604

chat_bubble_outline18

repeat14

shareShare

Aaron Defazio

@aaron_defazio

2 months ago

The mysterious Yann-masked researcher! I saw this talk live back in 2019 🤣

thumb_up_off_alt3

chat_bubble_outline2

repeat0

shareShare

Sham Kakade

@shamkakade6

2 months ago

1/6 Introducing Seesaw: a principled batch size scheduling algo. Seesaw achieves theoretically optimal serial run time given a fixed compute budget and also matches the performance of cosine annealing at fixed batch size.

thumb_up_off_alt241

chat_bubble_outline2

repeat34

shareShare

Aaron Defazio

@aaron_defazio

2 months ago

> So what do you do? - I research the mathematical foundations of learning through optimization, seeking to understand the algorithmic keys to unlock super-intelligence. > **blank stare** - I work in tech > Oh cool. One of my best friends is a coder

thumb_up_off_alt128

chat_bubble_outline7

repeat3

shareShare

jack morris

@jxmnop

2 months ago

# The Kolmogorov complexity of new research every new research or blog can be compressed, in its ‘essence’, to three things - code - artifacts (outputs of code execution) - math (novel abstractions) one of the main hopes I have for near-term AI systems is that they will

thumb_up_off_alt293

chat_bubble_outline14

repeat27

shareShare

Aaron Defazio

@aaron_defazio

2 months ago

AI Researchers: Epsilon greedy is such a bad algorithm lol we have to invent something better Also AI Researchers: So I spend 90% of my time on ideas that I’m pretty sure will work, publish or perish right? And maybe 10% on blue-sky stuff, you know? Really out there ideas.

thumb_up_off_alt63

chat_bubble_outline4

repeat3

shareShare

Mathieu

@miniapeur

2 months ago

> So what do you do? - I do research and write scientific papers. > **blank stare** - I am a PhD student > Oh cool. The soup kitchen is this way

thumb_up_off_alt202

chat_bubble_outline1

repeat7

shareShare

Minh Nhat Nguyen

@menhguin

2 months ago

heartbreaking: research idea you discarded bc it had obvious flaws is now a viral banger on the TL

thumb_up_off_alt463

chat_bubble_outline11

repeat12

shareShare

Konstantin Mishchenko

@konstmish

2 months ago

Using Muon on large sharded models creates extra communication overhead from gather/scatter operations on sharded matrices. Turns out you can fix this by doing full Muon updates periodically (but not skipping them entirely) and using local Muon computation the rest of the time.

thumb_up_off_alt97

chat_bubble_outline1

repeat8

shareShare

Konstantin Mishchenko

@konstmish

2 months ago

I find it fascinating that momentum in standard convex optimization is just about making convergence faster, but in nonconvex problems, it's sometimes the only way a method can work at all. Just saw a new example of this phenomenon in the case of difference-of-convex functions.

thumb_up_off_alt135

chat_bubble_outline3

repeat15

shareShare

Vinay S Rao

@vinaysrao

2 months ago

While at Meta, I worked on this optimizer-wrapper (outer step lookahead momentum) we're calling Snoo (arxiv.org/abs/2510.15830). You can use it with AdamW or Muon and see really strong scaling. Here's a plot where we ran it against (tuned) AdamW up to 1e23 training flop scales.

thumb_up_off_alt231

chat_bubble_outline5

repeat21

shareShare

Sham Kakade

@shamkakade6

2 months ago

(1/9) Diagonal preconditioners such as Adam typically use empirical gradient information rather than true second-order curvature. Is this merely a computational compromise or can it be advantageous? Our work confirms the latter: Adam can outperform Gauss-Newton in certain cases.

thumb_up_off_alt128

chat_bubble_outline2

repeat18

shareShare

Aaron Defazio

@aaron_defazio

2 months ago

DAGGER was one of those instant-classic methods. It makes so much sense

thumb_up_off_alt74

chat_bubble_outline0

repeat1

shareShare

Aaron Defazio

@aaron_defazio

2 months ago

Very cool, read the tech-report! But.... Will locally connected ising-like models work at larger scales? Do they need to scale connectivity or just scale grid size.... It's not clear to me, but exciting if it works.

thumb_up_off_alt17

chat_bubble_outline2

repeat0

shareShare

Quanquan Gu

@quanquangu

2 months ago

No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.

thumb_up_off_alt93

chat_bubble_outline8

repeat4

shareShare

Aaron Defazio

Harsh Bhatt

Soumith Chintala

Dwarkesh Patel

Peyman Milanfar

Aaron Defazio

Sham Kakade

Aaron Defazio

jack morris

Aaron Defazio

Mathieu

Minh Nhat Nguyen

Konstantin Mishchenko

Konstantin Mishchenko

Vinay S Rao

Sham Kakade

Aaron Defazio

Aaron Defazio

Quanquan Gu