Alex Damian (@alex_damian_) Twitter Tweets • TwiCopy

Alex Damian

@alex_damian_

+ Follow

ID: 1374506243612536841

calendar_today23-03-2021 23:40:10

13 Tweet

290 Followers

84 Following

Sepp Hochreiter

@hochreitersepp

3 years ago

ArXiv arxiv.org/abs/2209.15594: Analysis of SGD. Sharpness (largest eigenvalue of the Hessian) steadily increases during training until instability cutoff 2/η then it hovers around 2/η. Training loss still decreases. Reason: self-stabilization via cubic term in Taylor expansion.

thumb_up_off_alt188

chat_bubble_outline2

repeat47

shareShare

Eshaan Nichani

@eshaannichani

3 years ago

New paper with Alex Damian and Jason Lee! We identify a new implicit bias of GD: Self-Stabilization. When the loss is too sharp and iterates begin to diverge, self-stabilization decreases sharpness until GD is stable. This explains the “Edge of Stability” phenomenon! (1/3)

thumb_up_off_alt23

chat_bubble_outline1

repeat5

shareShare

Zhiyuan Li

@zhiyuanli_

2 years ago

🚨💡We are organizing a workshop on Mathematics of Modern Machine Learning (M3L) at #NeurIPS2023! 🚀Join us if you are interested in exploring theories for understanding and advancing modern ML practice. sites.google.com/view/m3l-2023 Submission ddl: October 2, 2023 M3L Workshop @ NeurIPS 2024

thumb_up_off_alt147

chat_bubble_outline1

repeat24

shareShare

M3L Workshop @ NeurIPS 2024

@m3lworkshop

2 years ago

Hope everyone had a great time at M3L today! Many thanks to the speakers, authors, reviewers, participants and volunteers for all your contributions that made this workshop fun and successful, we hope to see you again next year! 😃✨

thumb_up_off_alt32

chat_bubble_outline0

repeat5

shareShare

fly51fly

@fly51fly

2 years ago

[LG] How Transformers Learn Causal Structure with Gradient Descent E Nichani, A Damian, J D. Lee [Princeton University] (2024) arxiv.org/abs/2402.14735 - The paper studies how transformers learn causal structure through gradient descent when trained on a novel in-context learning

thumb_up_off_alt169

chat_bubble_outline2

repeat47

shareShare

Eshaan Nichani

@eshaannichani

2 years ago

Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with Alex Damian Jason Lee! arxiv.org/abs/2402.14735 (1/10)

thumb_up_off_alt432

chat_bubble_outline6

repeat101

shareShare