Aleksandr Dremov (@alexdremov_me) Twitter Tweets • TwiCopy

Aleksandr Dremov

@alexdremov_me

+ Follow

ML Engineer | Student at EPFL

ID: 817763527679062016

linkhttps://alexdremov.me calendar_today07-01-2017 16:03:13

36 Tweet

37 Followers

115 Following

Daniel Paleka

@dpaleka

a year ago

It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. 🧵

thumb_up_off_alt453

chat_bubble_outline25

repeat75

shareShare

Collection of insane and fun facts about SQLite. Let's go! SQLite is the most deployed and most used database. There are over one trillion (1000000000000 or a million million) SQLite databases in active use. It is maintained by three people. They don't allow outside

thumb_up_off_alt11,11K

chat_bubble_outline131

repeat1,1K

shareShare

Aleksandr Dremov

@alexdremov_me

a year ago

Why is Flash Attention so fast? Find out how Flash Attention works. Afterward, we'll polish our understanding by writing a GPU kernel of the algorithm in Triton. alexdremov.me/understanding-… #MachineLearning

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Alex Hägele

@haeggee

4 months ago

New TMLR paper by Master (!) student @alexdremov_me: Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler We finally understand the negative square root (1-sqrt) cooldown. TL;DR: It gets you best bias-variance tradeoff :) w/ Atli Kosson, M Jaggi

thumb_up_off_alt131

chat_bubble_outline3

repeat17

shareShare

Alex Hägele

@haeggee

4 months ago

New work from our MLO lab EPFL: Benchmarking the variety of different proposed LLM optimizers: Muon, AdEMAMix, ... all in the same setting, tuned, with varying model size, batch size, and training duration! Huge sweep of experiments by Andrei Semenov Matteo Pagliardini M Jaggi

New work from our MLO lab <a href="/EPFL_en/">EPFL</a>:
Benchmarking the variety of different proposed LLM optimizers: Muon, AdEMAMix, ... all in the same setting, tuned, with varying model size, batch size, and training duration! Huge sweep of experiments by <a href="/AndreiSemenov17/">Andrei Semenov</a> <a href="/MatPagliardini/">Matteo Pagliardini</a> M Jaggi

thumb_up_off_alt331

chat_bubble_outline5

repeat65

shareShare

Awni Hannun

@awnihannun

3 months ago

This is one of my favorite results from our work on scaling laws for QAT. It helps answer the question: should you train an 8-bit model or a 4 bit model that has twice as many parameters? Both are the same size in RAM which is highly correlated with generation latency. Or even

thumb_up_off_alt123

chat_bubble_outline7

repeat15

shareShare

Atli Kosson

@atlikosson

2 months ago

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

thumb_up_off_alt289

chat_bubble_outline11

repeat41

shareShare

Aleksandr Dremov

@alexdremov_me

2 months ago

My favorite beatbox machine is my Mac struggling to play Spotify under 100% system load

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Aleksandr Dremov

@alexdremov_me

a month ago

8 8 6 2 And 2 being something like “the writing is not good enough”. Well, thanks for the valuable advice I guess

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

Aleksandr Dremov

@alexdremov_me

a month ago

Well, we developed an intuitive framework for comparing different cooldown shapes (ps. there’s nothing special about the sqrt shape) arxiv.org/abs/2508.01483

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Aleksandr Dremov

@alexdremov_me

a month ago

there’s more and more reasons to submit your paper to TMLR 🙂

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare