Amirkeivan Mohtashami (@akmohtashami_a) Twitter Tweets • TwiCopy

Maksym Andriushchenko @ ICLR

2 years ago

🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” arxiv.org/abs/2305.16292! ❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM? (with Dara Bahri, Hossein Mobahi, N. Flammarion) 🧵1/n

thumb_up_off_alt151

chat_bubble_outline4

repeat24

shareShare

Amirkeivan Mohtashami

@akmohtashami_a

2 years ago

📢🚀 It's here! 💥👏 Just released the code for landmark attention! 🔗 Check it out on GitHub: github.com/epfml/landmark… #Transformers #LLM #GPT4

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Konstantin Mishchenko

@konstmish

2 years ago

SGD in practice usually doesn't sample data uniformly and instead goes over the dataset in epochs, which is called Random Reshuffling. We've known for some time that RR is better than SGD for convex functions and now it's been proven for nonconvex: arxiv.org/abs/2305.19259

thumb_up_off_alt187

chat_bubble_outline5

repeat27

shareShare

Matteo Pagliardini

@matpagliardini

2 years ago

How to speed up the training of transformers over large sequences? Many methods sparsify the attention matrix with static patterns. Could we use dynamic (e.g. adaptive) patterns? A thread! Joint work with Daniele Paliotta (equal contribution), François Fleuret , and Martin Jaggi

thumb_up_off_alt436

chat_bubble_outline3

repeat81

shareShare

Sadegh Farhadkhani

@sadegh_farhad

2 years ago

ICML 22&23, and NuerIPS 22&23, all have been/will be held in the US. I know it's not easy to organize a conference of this size. Yet I am really curious to know whether people who have difficulties traveling to the US were part of the equation for these decisions or not.

thumb_up_off_alt5

chat_bubble_outline0

repeat3

shareShare

Ahmad Beirami @ ICLR 2025

@abeirami

2 years ago

Both ICML Conference and NeurIPS Conference held in the US in 2022-23! The US is one of the most visa unfriendly states (appointment wait times 6+ months, processing another 6+ months), this is significantly hurting diversity & inclusion. We should strive to do better! #ICML2023 #NeurIPS2023

thumb_up_off_alt152

chat_bubble_outline8

repeat22

shareShare

Yann LeCun

@ylecun

2 years ago

The US is notorious for it's mass shootings, but its immigration policies (or lack thereof) set a new standard for the art of shooting yourself in the foot with a bazooka. Advice to graduate students from countries the US doesn't like: just go to Europe.

thumb_up_off_alt1,1K

chat_bubble_outline77

repeat141

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

2 years ago

🚨 I'm looking for a postdoc position to start in Fall 2024! My most recent research interests are related to understanding foundation models (especially LLMs!), making them more reliable, and developing principled methods for deep learning. More info: andriushchenko.me

thumb_up_off_alt155

chat_bubble_outline9

repeat42

shareShare

Helia (Helyaneh) Ziaei Jam

@heliazj

2 years ago

Now out in Nature Communications! We developed EnsembleTR, an ensemble method to combine genotypes from 4 major tandem repeat callers and generated a genome-wide catalog of ~1.7 million TRs from 3550 samples in the 1000 Genomes and H3Africa cohorts. doi.org/10.1038/s41467…

thumb_up_off_alt51

chat_bubble_outline5

repeat10

shareShare

Amirkeivan Mohtashami

@akmohtashami_a

2 years ago

I am presenting Landmark Attention today at 17:15 in #NeurIPS2023. I will also present CoTFormer (arxiv.org/pdf/2310.10845) in WANT workshop on Saturday. Excited to meet some of you at either.

thumb_up_off_alt14

chat_bubble_outline0

repeat2

shareShare

Antoine Bosselut

@abosselut

2 years ago

That's right folks -- back in action for one night only! Come hear me talk about #LLM reasoning through parameter updates.

thumb_up_off_alt18

chat_bubble_outline0

repeat3

shareShare

Jeremy Howard

@jeremyphoward

2 years ago

If you're a Python programmer looking to get started with CUDA, this weekend I'll be doing a free 1 hour tutorial on the absolute basics. Thanks to Andreas Köpf, Mark Saroufim, and dfsgdf for hosting this on the CUDA MODE server. :D Click here: discord.gg/6z79K5Yh?event…

thumb_up_off_alt2,2K

chat_bubble_outline29

repeat268

shareShare

Atli Kosson

@atlikosson

2 years ago

Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for arxiv.org/abs/2305.17212 1/10

thumb_up_off_alt202

chat_bubble_outline4

repeat43

shareShare

Google AI

@googleai

2 years ago

People often teach one another by simply explaining a problem using natural language. Today we introduce an approach for model training wherein a teacher #LLM generates natural language instructions to train a student model with improved privacy. goo.gle/3P7SrXx

thumb_up_off_alt711

chat_bubble_outline35

repeat176

shareShare

Amirkeivan Mohtashami

@akmohtashami_a

2 years ago

Skip connections are not enough! We show that providing the individual outputs of previous layers to each Transformer layer significantly boosts its performance. See the thread for more! Had an amazing time collaborating with Matteo Pagliardini, François Fleuret, and Martin Jaggi.

thumb_up_off_alt18

chat_bubble_outline0

repeat1

shareShare

Saleh Ashkboos

@ashkboossaleh

2 years ago

[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With Amirkeivan Mohtashami @max_croci Dan Alistarh Torsten Hoefler 🇨🇭 James Hensman and others Paper: arxiv.org/abs/2404.00456 Code: github.com/spcl/QuaRot

[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features.
With <a href="/akmohtashami_a/">Amirkeivan Mohtashami</a> @max_croci <a href="/DAlistarh/">Dan Alistarh</a> <a href="/thoefler/">Torsten Hoefler 🇨🇭</a> <a href="/jameshensman/">James Hensman</a> and others

Paper: arxiv.org/abs/2404.00456
Code: github.com/spcl/QuaRot

thumb_up_off_alt303

chat_bubble_outline7

repeat64

shareShare

Alex Hägele

@haeggee

2 years ago

If you haven't seen it yesterday, the Mixture-of-Depths is a really nice idea for dynamic compute I decided to quickly code down a MoD block in a small GPT and try it out -- if you want to play with it too (and check correctness pls!), the code is here: github.com/epfml/llm-base…

thumb_up_off_alt229

chat_bubble_outline5

repeat50

shareShare

Saleh Ashkboos

@ashkboossaleh

2 years ago

We are working on quantizing #Llama3 using #QuaRot. I got some interesting results on WikiText PPL (FP16 model): (Seq Len=2048) LLaMa2-7B: 5.47 LLaMa3-8B: 6.14 (Seq Len=4096) LLaMa2-7B: 5.11 LLaMa3-8B: 5.75 Maybe Wiki. PPL is not a great metric to report anymore!

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare