Edward Milsom (@edward_milsom) Twitter Tweets • TwiCopy

Edward Milsom

@edward_milsom

+ Follow

Machine learning PhD student working on deep learning and deep kernel methods. Compass CDT, University of Bristol.

ID: 1506159823565561861

calendar_today22-03-2022 06:45:11

388 Tweet

371 Followers

321 Following

Jeremy Bernstein

@jxbz

8 months ago

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

thumb_up_off_alt885

chat_bubble_outline10

repeat128

shareShare

François Fleuret

@francoisfleuret

8 months ago

If you make me president, the login node will have GPUs.

thumb_up_off_alt142

chat_bubble_outline11

repeat6

shareShare

Tim Lawson ✈️ ICLR

@tslwn

8 months ago

There's a lot to process here, but I was pleased to see that Anthropic's 'Circuit Tracing' paper cites three of our recent contributions to the interpretability literature! 1/

thumb_up_off_alt85

chat_bubble_outline1

repeat6

shareShare

Edward Milsom

@edward_milsom

7 months ago

Easy (but informative) exercise: Show by induction that an exponential moving average is distributive i.e. EMA(\sum_i X_i)_t = \sum_i EMA(X_i)_t. What EMA initialisation strategies make the base case hold?

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Seunghyun Seo

@seunghyunseo7

7 months ago

wow, didnt know cs336 cover scaling things. scaling law, critical bsz, muP and so on. (this lecture slide screenshot is from 2024)

thumb_up_off_alt311

chat_bubble_outline0

repeat35

shareShare

will brown

@willccbb

7 months ago

singapore looks so cool i should've done more ablations

thumb_up_off_alt228

chat_bubble_outline10

repeat7

shareShare

Xidulu

@xidulu

7 months ago

I talked to a lot of people about "a weight decay paper from Wang and Aitchison" at ICLR, which is officially been accepted at #ICML2025 . Laurence summarized the stuff in our paper in the post, here I will talk about the connection with a *broad* collection of existing works 1/

thumb_up_off_alt44

chat_bubble_outline5

repeat10

shareShare

Edward Milsom

@edward_milsom

7 months ago

Function-Space Learning Rates has been accepted to ICML 2025! Go read about our paper here: x.com/edward_milsom/…

thumb_up_off_alt137

chat_bubble_outline3

repeat14

shareShare

Sam Bowyer

@sambowyer__

7 months ago

Our position paper on LLM eval error bars has just been accepted to ICML 2025 as a spotlight poster!

thumb_up_off_alt19

chat_bubble_outline1

repeat10

shareShare

Edward Milsom

@edward_milsom

6 months ago

It seems none of the big open-source models are using mu-P still (correct me if I'm wrong!). According to this it should be quite easy: cerebras.ai/blog/the-pract… Are there any major drawbacks to using mu-P? (I'd be very surprised if Grok wasn't using it because Greg Yang.)

thumb_up_off_alt21

chat_bubble_outline3

repeat1

shareShare

Laurence Aitchison

@laurence_ai

6 months ago

Happy to announce that my lab has four papers accepted at ICML, including one spotlight:

thumb_up_off_alt88

chat_bubble_outline2

repeat7

shareShare

Xidulu

@xidulu

6 months ago

This is really a beautiful idea: Autodiff alleviates graduate students' pain from manually deriving the gradient, but MuP-ish work brings the pain back! But this work provides a way that allows you to simply SHUT OFF your brain and get hparam transfer.

thumb_up_off_alt20

chat_bubble_outline0

repeat4

shareShare

Ben Anson

@benaibean

6 months ago

Is it possible to _derive_ an attention scheme with effective zero-shot generalisation? The answer turns out to be yes! To achieve this, we began by thinking about desirable properties for attention over long contexts, and we distilled 2 key conditions:

thumb_up_off_alt407

chat_bubble_outline6

repeat42

shareShare

Edward Milsom

@edward_milsom

5 months ago

Me: Asks literally any question LLM: Excellent! You're really getting to the heart of computer architecture / electrical infrastructure / The history of Barcelona. Don't flatter me LLM, I am aware of my own limitations, even if you are not.

thumb_up_off_alt9

chat_bubble_outline1

repeat0

shareShare

Edward Milsom

@edward_milsom

5 months ago

What's some "must read" literature on generalisation in neural networks? I keep thinking about this paper and it really makes me want to understand better the link between optimisation and generalisation. arxiv.org/abs/2302.12091

thumb_up_off_alt224

chat_bubble_outline5

repeat29

shareShare