Edward Milsom (@edward_milsom) 's Twitter Profile
Edward Milsom

@edward_milsom

Machine learning PhD student working on deep learning and deep kernel methods. Compass CDT, University of Bristol.

ID: 1506159823565561861

calendar_today22-03-2022 06:45:11

388 Tweet

371 Followers

321 Following

Jeremy Bernstein (@jxbz) 's Twitter Profile Photo

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning

(1/11)
Tim Lawson ✈️ ICLR (@tslwn) 's Twitter Profile Photo

There's a lot to process here, but I was pleased to see that Anthropic's 'Circuit Tracing' paper cites three of our recent contributions to the interpretability literature! 1/

Edward Milsom (@edward_milsom) 's Twitter Profile Photo

Easy (but informative) exercise: Show by induction that an exponential moving average is distributive i.e. EMA(\sum_i X_i)_t = \sum_i EMA(X_i)_t. What EMA initialisation strategies make the base case hold?

Seunghyun Seo (@seunghyunseo7) 's Twitter Profile Photo

wow, didnt know cs336 cover scaling things. scaling law, critical bsz, muP and so on. (this lecture slide screenshot is from 2024)

wow, didnt know cs336 cover scaling things. scaling law, critical bsz, muP and so on. (this lecture slide screenshot is from 2024)
Xidulu (@xidulu) 's Twitter Profile Photo

I talked to a lot of people about "a weight decay paper from Wang and Aitchison" at ICLR, which is officially been accepted at #ICML2025 . Laurence summarized the stuff in our paper in the post, here I will talk about the connection with a *broad* collection of existing works 1/

Edward Milsom (@edward_milsom) 's Twitter Profile Photo

It seems none of the big open-source models are using mu-P still (correct me if I'm wrong!). According to this it should be quite easy: cerebras.ai/blog/the-pract… Are there any major drawbacks to using mu-P? (I'd be very surprised if Grok wasn't using it because Greg Yang.)

Xidulu (@xidulu) 's Twitter Profile Photo

This is really a beautiful idea: Autodiff alleviates graduate students' pain from manually deriving the gradient, but MuP-ish work brings the pain back! But this work provides a way that allows you to simply SHUT OFF your brain and get hparam transfer.

Ben Anson (@benaibean) 's Twitter Profile Photo

Is it possible to _derive_ an attention scheme with effective zero-shot generalisation? The answer turns out to be yes! To achieve this, we began by thinking about desirable properties for attention over long contexts, and we distilled 2 key conditions:

Is it possible to _derive_ an attention scheme with effective zero-shot generalisation? The answer turns out to be yes! To achieve this, we began by thinking about desirable properties for attention over long contexts, and we distilled 2 key conditions:
Edward Milsom (@edward_milsom) 's Twitter Profile Photo

Me: Asks literally any question LLM: Excellent! You're really getting to the heart of computer architecture / electrical infrastructure / The history of Barcelona. Don't flatter me LLM, I am aware of my own limitations, even if you are not.

Edward Milsom (@edward_milsom) 's Twitter Profile Photo

What's some "must read" literature on generalisation in neural networks? I keep thinking about this paper and it really makes me want to understand better the link between optimisation and generalisation. arxiv.org/abs/2302.12091