Amirkeivan Mohtashami (@akmohtashami_a) 's Twitter Profile
Amirkeivan Mohtashami

@akmohtashami_a

PhD - EPFL

ID: 1661997220768432128

calendar_today26-05-2023 07:28:55

31 Tweet

183 Followers

88 Following

Maksym Andriushchenko @ ICLR (@maksym_andr) 's Twitter Profile Photo

🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” arxiv.org/abs/2305.16292! ❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM? (with Dara Bahri, Hossein Mobahi, N. Flammarion) 🧵1/n

🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” arxiv.org/abs/2305.16292!

❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM?

(with <a href="/dara_bahri/">Dara Bahri</a>, <a href="/TheGradient/">Hossein Mobahi</a>, N. Flammarion)
🧵1/n
Amirkeivan Mohtashami (@akmohtashami_a) 's Twitter Profile Photo

📢🚀 It's here! 💥👏 Just released the code for landmark attention! 🔗 Check it out on GitHub: github.com/epfml/landmark… #Transformers #LLM #GPT4

Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

SGD in practice usually doesn't sample data uniformly and instead goes over the dataset in epochs, which is called Random Reshuffling. We've known for some time that RR is better than SGD for convex functions and now it's been proven for nonconvex: arxiv.org/abs/2305.19259

SGD in practice usually doesn't sample data uniformly and instead goes over the dataset in epochs, which is called Random Reshuffling. We've known for some time that RR is better than SGD for convex functions and now it's been proven for nonconvex:
arxiv.org/abs/2305.19259
Matteo Pagliardini (@matpagliardini) 's Twitter Profile Photo

How to speed up the training of transformers over large sequences? Many methods sparsify the attention matrix with static patterns. Could we use dynamic (e.g. adaptive) patterns? A thread! Joint work with Daniele Paliotta (equal contribution), François Fleuret , and Martin Jaggi

How to speed up the training of transformers over large sequences? Many methods sparsify the attention matrix with static patterns.
Could we use dynamic (e.g. adaptive) patterns?
A thread! Joint work with <a href="/DanielePaliotta/">Daniele Paliotta</a>  (equal contribution), <a href="/francoisfleuret/">François Fleuret</a> , and Martin Jaggi
Sadegh Farhadkhani (@sadegh_farhad) 's Twitter Profile Photo

ICML 22&23, and NuerIPS 22&23, all have been/will be held in the US. I know it's not easy to organize a conference of this size. Yet I am really curious to know whether people who have difficulties traveling to the US were part of the equation for these decisions or not.

Ahmad Beirami @ ICLR 2025 (@abeirami) 's Twitter Profile Photo

Both ICML Conference and NeurIPS Conference held in the US in 2022-23! The US is one of the most visa unfriendly states (appointment wait times 6+ months, processing another 6+ months), this is significantly hurting diversity & inclusion. We should strive to do better! #ICML2023 #NeurIPS2023

Yann LeCun (@ylecun) 's Twitter Profile Photo

The US is notorious for it's mass shootings, but its immigration policies (or lack thereof) set a new standard for the art of shooting yourself in the foot with a bazooka. Advice to graduate students from countries the US doesn't like: just go to Europe.

Maksym Andriushchenko @ ICLR (@maksym_andr) 's Twitter Profile Photo

🚨 I'm looking for a postdoc position to start in Fall 2024! My most recent research interests are related to understanding foundation models (especially LLMs!), making them more reliable, and developing principled methods for deep learning. More info: andriushchenko.me

Helia (Helyaneh) Ziaei Jam (@heliazj) 's Twitter Profile Photo

Now out in Nature Communications! We developed EnsembleTR, an ensemble method to combine genotypes from 4 major tandem repeat callers and generated a genome-wide catalog of ~1.7 million TRs from 3550 samples in the 1000 Genomes and H3Africa cohorts. doi.org/10.1038/s41467…

Amirkeivan Mohtashami (@akmohtashami_a) 's Twitter Profile Photo

I am presenting Landmark Attention today at 17:15 in #NeurIPS2023. I will also present CoTFormer (arxiv.org/pdf/2310.10845) in WANT workshop on Saturday. Excited to meet some of you at either.

Jeremy Howard (@jeremyphoward) 's Twitter Profile Photo

If you're a Python programmer looking to get started with CUDA, this weekend I'll be doing a free 1 hour tutorial on the absolute basics. Thanks to Andreas Köpf, Mark Saroufim, and dfsgdf for hosting this on the CUDA MODE server. :D Click here: discord.gg/6z79K5Yh?event…

Atli Kosson (@atlikosson) 's Twitter Profile Photo

Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for arxiv.org/abs/2305.17212 1/10

Why does AdamW outperform Adam with L2-regularization?

Its effectiveness seems to stem from how it affects the angular update size of weight vectors!

This may also be the case for Weight Standardization, lr warmup and weight decay in general!
🧵 for arxiv.org/abs/2305.17212 1/10
Google AI (@googleai) 's Twitter Profile Photo

People often teach one another by simply explaining a problem using natural language. Today we introduce an approach for model training wherein a teacher #LLM generates natural language instructions to train a student model with improved privacy. goo.gle/3P7SrXx

People often teach one another by simply explaining a problem using natural language. Today we introduce an approach for model training wherein a teacher #LLM generates natural language instructions to train a student model with improved privacy. goo.gle/3P7SrXx
Amirkeivan Mohtashami (@akmohtashami_a) 's Twitter Profile Photo

Skip connections are not enough! We show that providing the individual outputs of previous layers to each Transformer layer significantly boosts its performance. See the thread for more! Had an amazing time collaborating with Matteo Pagliardini, François Fleuret, and Martin Jaggi.

Saleh Ashkboos (@ashkboossaleh) 's Twitter Profile Photo

[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With Amirkeivan Mohtashami @max_croci Dan Alistarh Torsten Hoefler 🇨🇭 James Hensman and others Paper: arxiv.org/abs/2404.00456 Code: github.com/spcl/QuaRot

[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. 
With <a href="/akmohtashami_a/">Amirkeivan Mohtashami</a> @max_croci <a href="/DAlistarh/">Dan Alistarh</a> <a href="/thoefler/">Torsten Hoefler 🇨🇭</a> <a href="/jameshensman/">James Hensman</a> and others

Paper: arxiv.org/abs/2404.00456
Code: github.com/spcl/QuaRot
Alex Hägele (@haeggee) 's Twitter Profile Photo

If you haven't seen it yesterday, the Mixture-of-Depths is a really nice idea for dynamic compute I decided to quickly code down a MoD block in a small GPT and try it out -- if you want to play with it too (and check correctness pls!), the code is here: github.com/epfml/llm-base…

If you haven't seen it yesterday, the Mixture-of-Depths is a really nice idea for dynamic compute
I decided to quickly code down a MoD block in a small GPT and try it out -- if you want to play with it too (and check correctness pls!), the code is here:
github.com/epfml/llm-base…
Saleh Ashkboos (@ashkboossaleh) 's Twitter Profile Photo

We are working on quantizing #Llama3 using #QuaRot. I got some interesting results on WikiText PPL (FP16 model): (Seq Len=2048) LLaMa2-7B: 5.47 LLaMa3-8B: 6.14 (Seq Len=4096) LLaMa2-7B: 5.11 LLaMa3-8B: 5.75 Maybe Wiki. PPL is not a great metric to report anymore!