Aleksandr Dremov (@alexdremov_me) 's Twitter Profile
Aleksandr Dremov

@alexdremov_me

ML Engineer | Student at EPFL

ID: 817763527679062016

linkhttps://alexdremov.me calendar_today07-01-2017 16:03:13

36 Tweet

37 Followers

115 Following

Daniel Paleka (@dpaleka) 's Twitter Profile Photo

It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. đź§µ

It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. đź§µ
v (@iavins) 's Twitter Profile Photo

Collection of insane and fun facts about SQLite. Let's go! SQLite is the most deployed and most used database. There are over one trillion (1000000000000 or a million million) SQLite databases in active use. It is maintained by three people. They don't allow outside

Aleksandr Dremov (@alexdremov_me) 's Twitter Profile Photo

Why is Flash Attention so fast? Find out how Flash Attention works. Afterward, we'll polish our understanding by writing a GPU kernel of the algorithm in Triton. alexdremov.me/understanding-… #MachineLearning

Alex Hägele (@haeggee) 's Twitter Profile Photo

New TMLR paper by Master (!) student @alexdremov_me: Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler We finally understand the negative square root (1-sqrt) cooldown. TL;DR: It gets you best bias-variance tradeoff :) w/ Atli Kosson, M Jaggi

New TMLR paper by Master (!) student @alexdremov_me:
Training Dynamics of the Cooldown Stage in
Warmup-Stable-Decay Learning Rate Scheduler
We finally understand the negative square root (1-sqrt) cooldown. TL;DR: It gets you best bias-variance tradeoff :) w/ <a href="/AtliKosson/">Atli Kosson</a>, M Jaggi
Alex Hägele (@haeggee) 's Twitter Profile Photo

New work from our MLO lab EPFL: Benchmarking the variety of different proposed LLM optimizers: Muon, AdEMAMix, ... all in the same setting, tuned, with varying model size, batch size, and training duration! Huge sweep of experiments by Andrei Semenov Matteo Pagliardini M Jaggi

New work from our MLO lab <a href="/EPFL_en/">EPFL</a>:
Benchmarking the variety of different proposed LLM optimizers: Muon, AdEMAMix, ... all in the same setting, tuned, with varying model size, batch size, and training duration! Huge sweep of experiments by <a href="/AndreiSemenov17/">Andrei Semenov</a> <a href="/MatPagliardini/">Matteo Pagliardini</a> M Jaggi
Awni Hannun (@awnihannun) 's Twitter Profile Photo

This is one of my favorite results from our work on scaling laws for QAT. It helps answer the question: should you train an 8-bit model or a 4 bit model that has twice as many parameters? Both are the same size in RAM which is highly correlated with generation latency. Or even

Atli Kosson (@atlikosson) 's Twitter Profile Photo

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work?

We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
Aleksandr Dremov (@alexdremov_me) 's Twitter Profile Photo

Well, we developed an intuitive framework for comparing different cooldown shapes (ps. there’s nothing special about the sqrt shape) arxiv.org/abs/2508.01483