Konstantin Mishchenko (@konstmish) 's Twitter Profile
Konstantin Mishchenko

@konstmish

Research Scientist @AIatMeta
Previously Researcher @ Samsung AI
Outstanding Paper Award @icmlconf 2023
Action Editor @TmlrOrg
I tweet about ML papers and math

ID: 1272954231721426945

linkhttp://konstmish.com/ calendar_today16-06-2020 18:08:40

606 Tweet

5,5K Takipçi

605 Takip Edilen

Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

OpenReview's LaTeX parser seems to be quite bad and it makes it very painful to be a reviewer sometimes. For example: "Assume $\hat{f}_n$ is smooth over $​ C_{n}$" can be parsed only if it's split into two paragraphs, which makes no sense. Can you please fix this? Open Review

OpenReview's LaTeX parser seems to be quite bad and it makes it very painful to be a reviewer sometimes. For example:
"Assume $\hat{f}_n$ is smooth over $​ C_{n}$"
can be parsed only if it's split into two paragraphs, which makes no sense.
Can you please fix this? <a href="/openreviewnet/">Open Review</a>
Ernest Ryu (@ernestryu) 's Twitter Profile Photo

Shuvomoy Das Gupta (Columbia) and I (UCLA) are starting an optimization seminar series! Our first speaker, Aaron Defazio (Meta), presented Schedules & Schedule-Free Learning. Aaron will give his NeurIPS 2024 Oral work next week. This is a longer version. (Video link in reply)

<a href="/shuvo_das_gupta/">Shuvomoy Das Gupta</a> (Columbia) and I (UCLA) are starting an optimization seminar series! Our first speaker, <a href="/aaron_defazio/">Aaron Defazio</a> (Meta), presented Schedules &amp; Schedule-Free Learning.

Aaron will give his NeurIPS 2024 Oral work next week. This is a longer version. (Video link in reply)
Alex Shtoff (@alexshtf) 's Twitter Profile Photo

I'm happy / thrilled / delighted / excited / passionate to share that our paper "A Stochastic Approach to the Subset Selection Problem via Mirror Descent" will appear in ICLR 2025 this year. Selecting a good subset of objects is fundamental not only in CS, but also in ML, i.e.

I'm happy / thrilled / delighted / excited / passionate to share that our paper "A Stochastic Approach to the Subset Selection Problem via Mirror Descent" will appear in <a href="/iclr_conf/">ICLR 2025</a> this year.

Selecting a good subset of objects is fundamental not only in CS, but also in ML, i.e.
Simo Ryu (@cloneofsimo) 's Twitter Profile Photo

What students expect from ML job: analysis of sharpness effecting generalization bound analysis of NTK parameterization on convergence sub quadratic attention novel convolution architecture novel SDE solver for fast sampling rao-blackwellized gradient estimator for faster

What students expect from ML job:

analysis of sharpness effecting generalization bound
analysis of NTK parameterization on convergence
sub quadratic attention
novel convolution architecture
novel SDE solver for fast sampling
rao-blackwellized gradient estimator for faster
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

1/n In new work, we draw connections between accelerated SGD and various recent optimizers including AdEMAMix, Schedule-Free optimizers and MARS, and use it to design ‘Simplified-AdEMAMix’ which matches performance of AdEMAMix without any extra momentum buffer.

1/n In new work, we draw connections between accelerated SGD and various recent optimizers including AdEMAMix, Schedule-Free optimizers and MARS, and use it to design ‘Simplified-AdEMAMix’ which matches performance of AdEMAMix without any extra momentum buffer.
Corca (@corca_math) 's Twitter Profile Photo

From the very beginning, one of our main goals was: Typing math should be as natural as typing plain text. No syntax, no coding—just start typing, no learning required

Jeremy Bernstein (@jxbz) 's Twitter Profile Photo

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning

(1/11)
AlgoPerf (@algoperf) 's Twitter Profile Photo

And the winner in the self-tuning ruleset, based on Schedule Free AdamW, demonstrated a new level of effectiveness for completely hyperparameter-free neural network training. Roughly ~10% faster training, compared to a NadamW baseline with well-tuned default hyperparameters.

And the winner in the self-tuning ruleset, based on Schedule Free AdamW, demonstrated a new level of effectiveness for completely hyperparameter-free neural network training. Roughly ~10% faster training, compared to a NadamW baseline with well-tuned default hyperparameters.
Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

It's crazy how good DiLoCo is. It's currently the main optimizer for distributed optimization, essentially a free lunch for increasing the critical batch size (beyond which the final loss starts degrading) by a factor of 50.

Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

Apparently, just making phrases such as "by the way" their own tokens can give +4% on lots of different tasks. Tokenizers still seem to be pretty disconnected from how we use them in LLMs, I wonder if we can use evolutionary algorithms or something like that to improve them.

Joelle Pineau (@jpineau1) 's Twitter Profile Photo

We're shipping the first models in the Llama 4 herd! Llama 4 Scout and Llama 4 Maverick, are our first MoE models, natively multimodal and are our most advanced yet — open-sourced as always. ai.meta.com/blog/llama-4-m…

Seunghyun Seo (@seunghyunseo7) 's Twitter Profile Photo

arxiv.org/abs/2504.05295 muon, scion, and dion now. quick skimming... authors proposed low rank updates version of muon for faster training at distributed setup. key idea is LRA, error feedback mechanism (for extremely low rank scenario).

arxiv.org/abs/2504.05295
muon, scion, and dion now.
quick skimming... authors proposed low rank updates version of muon for faster training at distributed setup.
key idea is LRA, error feedback mechanism (for extremely low rank scenario).
Nando de Freitas (@nandodf) 's Twitter Profile Photo

RL is not all you need, nor attention nor Bayesianism nor free energy minimisation, nor an age of first person experience. Such statements are propaganda. You need thousands of people working hard on data pipelines, scaling infrastructure, HPC, apps with feedback to drive