Kwangjun Ahn (@kwangjuna) 's Twitter Profile
Kwangjun Ahn

@kwangjuna

Senior Researcher at Microsoft Reserach // PhD from MIT EECS

ID: 1229766355622055936

linkhttp://kjahn.mit.edu/ calendar_today18-02-2020 13:56:04

40 Tweet

550 Followers

266 Following

Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

To gain insights we study the simplest possible toy model, a baby sparse coding problem: Covariate x \in R^d is white noise plus a spike y \in R in a random coordinate. Goal: predict y given x. To solve the task a neural net has to learn threshold units for each coordinate. 3/8

To gain insights we study the simplest possible toy model, a baby sparse coding problem:

Covariate x \in R^d is white noise plus a spike y \in R in a random coordinate. Goal: predict y given x.

To solve the task a neural net has to learn threshold units for each coordinate. 3/8
Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

Experimental results for this baby sparse coding problem are striking: bias term starts moving ONLY at large learning rate! Moreover, something else happens at large lr: suddenly the training loss starts to oscillate! (Note: oscillations are well-documented in empirical DL.) 4/8

Experimental results for this baby sparse coding problem are striking: bias term starts moving ONLY at large learning rate! Moreover, something else happens at large lr: suddenly the training loss starts to oscillate! (Note: oscillations are well-documented in empirical DL.) 4/8
Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

Coud it be that the emergence of the threshold units are related to these oscillations? Oscillations themselves have recently been under intense scrutiny by theoreticians under the name "Edge of Stability", a beautiful phenomenon discovered by Jeremy Cohen and co-authors. 5/8

Coud it be that the emergence of the threshold units are related to these oscillations? Oscillations themselves have recently been under intense scrutiny by theoreticians under the name "Edge of Stability", a beautiful phenomenon discovered by <a href="/deepcohen/">Jeremy Cohen</a> and co-authors. 5/8
Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

Indeed, this is exactly what our paper does: we directly connect the emergence of threshold units to the Edge of Stability phenomenon. What comes next in this story does not fit well in tweet format, I guess that's why there is a paper :-). 6/8

Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

Key highlight of our story: we discover a phase transition for neural network learning at lr 8π/d^2. Emergence (for baby sparse coding) happens iff lr > 8π/d^2 ... Of course, the bigger story about general-purpose circuits remains fully open. We just made a tiny step. 7/8

Key highlight of our story: we discover a phase transition for neural network learning at lr 8π/d^2. Emergence (for baby sparse coding) happens iff lr &gt; 8π/d^2 ... 

Of course, the bigger story about general-purpose circuits remains fully open. We just made a tiny step. 7/8
Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

Project was led by three incredible MIT students, Kwangjun Ahn, Sinho Chewi and Felipe Suárez Colmenares. I cannot recommend them strongly enough. Project went so far beyond what I expected to be true at the beginning, let alone what would be *provable*. Such a pleasure to work with them. 8/8

Francesco Orabona (@bremen79) 's Twitter Profile Photo

Parameter-free optimization: 11 years of research and almost no code... So, I wrote a PyTorch library and to add all the parameter-free algos I know! Currently with COCOB and KT: old but sometimes even better than some newer variants 😉 github.com/bremen79/param… Please retweet!

Parameter-free optimization: 11 years of research and almost no code...

So, I wrote a PyTorch library and to add all the parameter-free algos I know!

Currently with COCOB and KT: old but sometimes even better than some newer variants 😉

github.com/bremen79/param…

Please retweet!
Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

My group is hiring a large cohort of interns for the summer of 2024 to work on the Foundations of Large Language Models! Come help us uncover the new physics of A.I. to improve the LLM building practices! (Pic below from our NeurIPS 2023 paper w. interns) jobs.careers.microsoft.com/global/en/job/…

My group is hiring a large cohort of interns for the summer of 2024 to work on the Foundations of Large Language Models! Come help us uncover the new physics of A.I. to improve the LLM building practices! (Pic below from our NeurIPS 2023 paper w. interns)

jobs.careers.microsoft.com/global/en/job/…
Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

Excited to share our NeurIPS paper on (theoretically) understanding in-context learning based on linear transformers! Please check out the details in arxiv.org/abs/2306.00297

Excited to share our NeurIPS paper on (theoretically) understanding in-context learning based on linear transformers! Please check out the details in arxiv.org/abs/2306.00297
Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

Excited to share our NeurIPS paper that Sebastien Bubeck mentioned in his post: arxiv.org/abs/2212.07469 Also check out a NeurIPS paper on understanding SAM (a companion paper!) arxiv.org/abs/2305.15287 My talk video from INFORMS about these works: youtu.be/TMmpeVBbD7o?si…

Excited to share our NeurIPS paper that <a href="/SebastienBubeck/">Sebastien Bubeck</a> mentioned in his post: arxiv.org/abs/2212.07469 
Also check out a NeurIPS paper on understanding SAM (a companion paper!) arxiv.org/abs/2305.15287
   
My talk video from INFORMS about these works: youtu.be/TMmpeVBbD7o?si…
Ahmad Beirami (@abeirami) 's Twitter Profile Photo

If you're at #NeurIPS2023, Kwangjun Ahn will be presenting his work on SpecTr++ in Optimal Transport workshop where he discusses improved transport plans for speculative decoding.

Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

Exciting new paper by Kwangjun Ahn (Kwangjun Ahn) and Ashok Cutkosky (Ashok Cutkosky)! Adam with model exponential moving average is effective for nonconvex optimization arxiv.org/pdf/2405.18199 This approach to analyzing Adam is extremely promising IMHO.

Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

In our ICML 2024 paper (ICML Conference), joint w/ Zhiyu Zhang (Zhiyu Zhang), Yunbum Kook, Yan Dai, we provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

In our ICML 2024 paper (<a href="/icmlconf/">ICML Conference</a>),  joint w/ Zhiyu Zhang (<a href="/imZhiyuZ/">Zhiyu Zhang</a>), Yunbum Kook, Yan Dai, we provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)
Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

Come to my presentation of ICML 2024 paper tmrw at 1:30–3 pm! We provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

Come to my presentation of ICML 2024 paper tmrw at 1:30–3 pm!
We provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)