Konstantin Mishchenko (@konstmish) Twitter Tweets • TwiCopy

Konstantin Mishchenko

@konstmish

+ Follow

Research Scientist @AIatMeta
Previously Researcher @ Samsung AI
Outstanding Paper Award @icmlconf 2023
Action Editor @TmlrOrg
I tweet about ML papers and math

ID: 1272954231721426945

linkhttp://konstmish.com/ calendar_today16-06-2020 18:08:40

606 Tweet

5,5K Followers

605 Following

Konstantin Mishchenko

@konstmish

a year ago

OpenReview's LaTeX parser seems to be quite bad and it makes it very painful to be a reviewer sometimes. For example: "Assume $\hat{f}_n$ is smooth over $ C_{n}$" can be parsed only if it's split into two paragraphs, which makes no sense. Can you please fix this? Open Review

$OpenReview's LaTeX parser seems to be quite bad and it makes it very painful to be a reviewer sometimes. For example: "Assume $\hat{f}_n$ is smooth over $ C_{n}$" can be parsed only if it's split into two paragraphs, which makes no sense. Can you please fix this? <a href="/openreviewnet/">Open Review</a>$

thumb_up_off_alt14

chat_bubble_outline1

repeat1

shareShare

Ernest Ryu

@ernestryu

a year ago

Shuvomoy Das Gupta (Columbia) and I (UCLA) are starting an optimization seminar series! Our first speaker, Aaron Defazio (Meta), presented Schedules & Schedule-Free Learning. Aaron will give his NeurIPS 2024 Oral work next week. This is a longer version. (Video link in reply)

<a href="/shuvo_das_gupta/">Shuvomoy Das Gupta</a> (Columbia) and I (UCLA) are starting an optimization seminar series! Our first speaker, <a href="/aaron_defazio/">Aaron Defazio</a> (Meta), presented Schedules & Schedule-Free Learning.

Aaron will give his NeurIPS 2024 Oral work next week. This is a longer version. (Video link in reply)

thumb_up_off_alt81

chat_bubble_outline2

repeat10

shareShare

Francis Bach

@bachfrancis

a year ago

My book is (at last) out, just in time for Christmas! A blog post to celebrate and present it: francisbach.com/my-book-is-out/

thumb_up_off_alt2,2K

chat_bubble_outline32

repeat327

shareShare

Alex Shtoff

@alexshtf

a year ago

I'm happy / thrilled / delighted / excited / passionate to share that our paper "A Stochastic Approach to the Subset Selection Problem via Mirror Descent" will appear in ICLR 2025 this year. Selecting a good subset of objects is fundamental not only in CS, but also in ML, i.e.

thumb_up_off_alt125

chat_bubble_outline1

repeat17

shareShare

Simo Ryu

@cloneofsimo

10 months ago

What students expect from ML job: analysis of sharpness effecting generalization bound analysis of NTK parameterization on convergence sub quadratic attention novel convolution architecture novel SDE solver for fast sampling rao-blackwellized gradient estimator for faster

thumb_up_off_alt528

chat_bubble_outline17

repeat23

shareShare

Sham Kakade

@shamkakade6

10 months ago

1/n In new work, we draw connections between accelerated SGD and various recent optimizers including AdEMAMix, Schedule-Free optimizers and MARS, and use it to design ‘Simplified-AdEMAMix’ which matches performance of AdEMAMix without any extra momentum buffer.

thumb_up_off_alt67

chat_bubble_outline3

repeat13

shareShare

Corca

@corca_math

10 months ago

From the very beginning, one of our main goals was: Typing math should be as natural as typing plain text. No syntax, no coding—just start typing, no learning required

thumb_up_off_alt2,2K

chat_bubble_outline78

repeat212

shareShare

Jeremy Bernstein

@jxbz

9 months ago

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

thumb_up_off_alt885

chat_bubble_outline10

repeat128

shareShare

Zhuang Liu

@liuzhuang1234

9 months ago

New paper - Transformers, but without normalization layers (1/n)

thumb_up_off_alt4,4K

chat_bubble_outline77

repeat615

shareShare

AlgoPerf

@algoperf

9 months ago

And the winner in the self-tuning ruleset, based on Schedule Free AdamW, demonstrated a new level of effectiveness for completely hyperparameter-free neural network training. Roughly ~10% faster training, compared to a NadamW baseline with well-tuned default hyperparameters.

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

Konstantin Mishchenko

@konstmish

9 months ago

It's crazy how good DiLoCo is. It's currently the main optimizer for distributed optimization, essentially a free lunch for increasing the critical batch size (beyond which the final loss starts degrading) by a factor of 50.

thumb_up_off_alt134

chat_bubble_outline4

repeat11

shareShare

Konstantin Mishchenko

@konstmish

9 months ago

Apparently, just making phrases such as "by the way" their own tokens can give +4% on lots of different tasks. Tokenizers still seem to be pretty disconnected from how we use them in LLMs, I wonder if we can use evolutionary algorithms or something like that to improve them.

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Joelle Pineau

@jpineau1

8 months ago

We're shipping the first models in the Llama 4 herd! Llama 4 Scout and Llama 4 Maverick, are our first MoE models, natively multimodal and are our most advanced yet — open-sourced as always. ai.meta.com/blog/llama-4-m…

thumb_up_off_alt270

chat_bubble_outline6

repeat33

shareShare

Seunghyun Seo

@seunghyunseo7

8 months ago

arxiv.org/abs/2504.05295 muon, scion, and dion now. quick skimming... authors proposed low rank updates version of muon for faster training at distributed setup. key idea is LRA, error feedback mechanism (for extremely low rank scenario).

thumb_up_off_alt190

chat_bubble_outline4

repeat20

shareShare

Nando de Freitas

@nandodf

8 months ago

RL is not all you need, nor attention nor Bayesianism nor free energy minimisation, nor an age of first person experience. Such statements are propaganda. You need thousands of people working hard on data pipelines, scaling infrastructure, HPC, apps with feedback to drive

thumb_up_off_alt1,1K

chat_bubble_outline31

repeat193

shareShare