Aaron Defazio (@aaron_defazio) 's Twitter Profile
Aaron Defazio

@aaron_defazio

Research Scientist at Meta working on optimization. Fundamental AI Research (FAIR) team

ID: 1376951872356024325

linkhttp://aarondefazio.com calendar_today30-03-2021 17:38:10

907 Tweet

7,7K Takipçi

544 Takip Edilen

Harsh Bhatt (@harshbhatt7585) 's Twitter Profile Photo

trained with Schedulefree AdamW optimizer (right) and it is so smooth than the normal AdamW (left). here's the PR checkout how it is being used and integrated: github.com/Metta-AI/metta…

trained with Schedulefree AdamW optimizer (right) and it is so smooth than the normal AdamW (left).

here's the PR checkout how it is being used and integrated:
github.com/Metta-AI/metta…
Soumith Chintala (@soumithchintala) 's Twitter Profile Photo

MacStudio you ask? Apple Engineering's **actual** time spent on PyTorch support has't given me confidence that PyTorch Mac experience would get anywhere close to NVIDIA's any time soon, if ever. The Meta engineers continue to do a huge amount of heavy-lifting for improving the

Dwarkesh Patel (@dwarkesh_sp) 's Twitter Profile Photo

The Andrej Karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self

Peyman Milanfar (@docmilanfar) 's Twitter Profile Photo

PLEASE DON’T PLOT "SCALING LAW" THINGS ON LOG-LOG SCALE. EVERYTHING IS ~ LINEAR IN LOG-LOG SCALE. THANK YOU FOR YOUR ATTENTION TO THIS MATTER!

Sham Kakade (@shamkakade6) 's Twitter Profile Photo

1/6 Introducing Seesaw: a principled batch size scheduling algo. Seesaw achieves theoretically optimal serial run time given a fixed compute budget and also matches the performance of cosine annealing at fixed batch size.

1/6 Introducing Seesaw: a principled batch size scheduling algo. Seesaw achieves theoretically optimal serial run time given a fixed compute budget and also matches the performance of cosine annealing at fixed batch size.
Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

> So what do you do? - I research the mathematical foundations of learning through optimization, seeking to understand the algorithmic keys to unlock super-intelligence. > **blank stare** - I work in tech > Oh cool. One of my best friends is a coder

jack morris (@jxmnop) 's Twitter Profile Photo

# The Kolmogorov complexity of new research every new research or blog can be compressed, in its ‘essence’, to three things - code - artifacts (outputs of code execution) - math (novel abstractions) one of the main hopes I have for near-term AI systems is that they will

Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

AI Researchers: Epsilon greedy is such a bad algorithm lol we have to invent something better Also AI Researchers: So I spend 90% of my time on ideas that I’m pretty sure will work, publish or perish right? And maybe 10% on blue-sky stuff, you know? Really out there ideas.

Mathieu (@miniapeur) 's Twitter Profile Photo

> So what do you do? - I do research and write scientific papers. > **blank stare** - I am a PhD student > Oh cool. The soup kitchen is this way

Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

Using Muon on large sharded models creates extra communication overhead from gather/scatter operations on sharded matrices. Turns out you can fix this by doing full Muon updates periodically (but not skipping them entirely) and using local Muon computation the rest of the time.

Using Muon on large sharded models creates extra communication overhead from gather/scatter operations on sharded matrices. Turns out you can fix this by doing full Muon updates periodically (but not skipping them entirely) and using local Muon computation the rest of the time.
Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

I find it fascinating that momentum in standard convex optimization is just about making convergence faster, but in nonconvex problems, it's sometimes the only way a method can work at all. Just saw a new example of this phenomenon in the case of difference-of-convex functions.

I find it fascinating that momentum in standard convex optimization is just about making convergence faster, but in nonconvex problems, it's sometimes the only way a method can work at all. Just saw a new example of this phenomenon in the case of difference-of-convex functions.
Vinay S Rao (@vinaysrao) 's Twitter Profile Photo

While at Meta, I worked on this optimizer-wrapper (outer step lookahead momentum) we're calling Snoo (arxiv.org/abs/2510.15830). You can use it with AdamW or Muon and see really strong scaling. Here's a plot where we ran it against (tuned) AdamW up to 1e23 training flop scales.

While at Meta, I worked on this optimizer-wrapper (outer step lookahead momentum) we're calling Snoo (arxiv.org/abs/2510.15830). You can use it with AdamW or Muon and see really strong scaling. Here's a plot where we ran it against (tuned) AdamW up to 1e23 training flop scales.
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

(1/9) Diagonal preconditioners such as Adam typically use empirical gradient information rather than true second-order curvature. Is this merely a computational compromise or can it be advantageous? Our work confirms the latter: Adam can outperform Gauss-Newton in certain cases.

(1/9) Diagonal preconditioners such as Adam typically use empirical gradient information rather than true second-order curvature. Is this merely a computational compromise or can it be advantageous? Our work confirms the latter: Adam can outperform Gauss-Newton in certain cases.
Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

Very cool, read the tech-report! But.... Will locally connected ising-like models work at larger scales? Do they need to scale connectivity or just scale grid size.... It's not clear to me, but exciting if it works.

Quanquan Gu (@quanquangu) 's Twitter Profile Photo

No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.