Simone Scardapane(@s_scardapane) 's Twitter Profileg
Simone Scardapane

@s_scardapane

I fall in love with a new #machinelearning topic every month 🙄
Tenure-track Ass. Prof. @SapienzaRoma | Previously @iaml_it @SmarterPodcast | @GoogleDevExpert

ID:1235205731747540993

linkhttps://www.sscardapane.it/ calendar_today04-03-2020 14:09:51

1,4K Tweets

8,2K Followers

672 Following

Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*A Primer on the Inner Workings of Transformer LMs*
by Javier Ferrando Gabriele Sarti Arianna Bisazza Marta R. Costa-jussa

I was waiting for this! Cool comprehensive survey on interpretability methods for LLMs, with a focus on recent techniques (e.g., logit lens).

arxiv.org/abs/2405.00208

*A Primer on the Inner Workings of Transformer LMs* by @javifer_96 @gsarti_ @AriannaBisazza @costajussamarta I was waiting for this! Cool comprehensive survey on interpretability methods for LLMs, with a focus on recent techniques (e.g., logit lens). arxiv.org/abs/2405.00208
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Kolmogorov-Arnold Networks (KANs)* by Ziming Liu et al.

Since everyone is talking about KANs, I wrote some notes on Notion with a few research questions I find interesting.

First time I do something like this, give me some feedback. 🙃

sscardapane.notion.site/Kolmogorov-Arn…

*Kolmogorov-Arnold Networks (KANs)* by @ZimingLiu11 et al. Since everyone is talking about KANs, I wrote some notes on Notion with a few research questions I find interesting. First time I do something like this, give me some feedback. 🙃 sscardapane.notion.site/Kolmogorov-Arn…
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Decomposing and Editing Predictions by Modeling Model Computation*
by Harshay Shah Aleksander Madry andrewilyas

Learning to predict the effect of ablating a model component (e.g., a head) is helpful for understanding the model behavior and also for editing.

arxiv.org/abs/2404.11534

*Decomposing and Editing Predictions by Modeling Model Computation* by @harshays_ @aleks_madry @andrewilyas Learning to predict the effect of ablating a model component (e.g., a head) is helpful for understanding the model behavior and also for editing. arxiv.org/abs/2404.11534
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*REPAIR: REnormalizing Permuted Activations for Interpolation Repair*
by Keller Jordan Hanie Sedghi Olga Saukh RahimEntezari
Behnam Neyshabur

Correcting the statistics of a layer significantly improves model fusion based on permutations of units.

arxiv.org/abs/2211.08403

*REPAIR: REnormalizing Permuted Activations for Interpolation Repair* by @kellerjordan0 @HanieSedghi @osaukh @rahiment @bneyshabur Correcting the statistics of a layer significantly improves model fusion based on permutations of units. arxiv.org/abs/2211.08403
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*A Multimodal Automated Interpretability Agent*
by Tamar Rott Shaham Sarah Schwettmann
Franklin Wang Achyuta Rajaram
Evan Hernandez Jacob Andreas

An experiment in using a multimodal VLM to generate hypotheses to explain a given neuron behavior.

arxiv.org/abs/2404.14394

*A Multimodal Automated Interpretability Agent* by @TamarRottShaham @cogconfluence @f_x_wang @AchyutaBot @evanqed @jacobandreas An experiment in using a multimodal VLM to generate hypotheses to explain a given neuron behavior. arxiv.org/abs/2404.14394
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Patchscopes: Inspecting Hidden Representations of LLMs*
by Asma Ghandeharioun Avi Caciularu Adam Pearce
iislucas (Lucas Dixon) Mor Geva

A framework for explaining LLMs via 'patching', where a separate LLM is used to translate the internal embeddings to an explanation.

arxiv.org/abs/2401.06102

*Patchscopes: Inspecting Hidden Representations of LLMs* by @ghandeharioun @clu_avi @adamrpearce @iislucas @megamor2 A framework for explaining LLMs via 'patching', where a separate LLM is used to translate the internal embeddings to an explanation. arxiv.org/abs/2401.06102
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*An Embarrassingly Simple Approach for LLM with Strong ASR Capacity*
by Ziyang Ma et al.

You can get SOTA performance on speech recognition by (fundamentally) asking a pre-trained LLM to transcribe the audio.

arxiv.org/abs/2402.08846

*An Embarrassingly Simple Approach for LLM with Strong ASR Capacity* by @ddlbojack et al. You can get SOTA performance on speech recognition by (fundamentally) asking a pre-trained LLM to transcribe the audio. arxiv.org/abs/2402.08846
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Transformer Feed-Forward Layers Are Key-Value Memories*
by Mor Geva Roei Schuster Omer Levy

Less known explainability result: the rows and columns of MLPs in transformers can be visualized to find human-understandable patterns.

aclanthology.org/2021.emnlp-mai…

*Transformer Feed-Forward Layers Are Key-Value Memories* by @megamor2 @RoeiSchuster @omerlevy_ Less known explainability result: the rows and columns of MLPs in transformers can be visualized to find human-understandable patterns. aclanthology.org/2021.emnlp-mai…
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*RHO-1: Not All Tokens Are What You Need*
by Zhibin Gou Weizhu Chen

A small training phase on curated data helps in filtering out useful and harmful tokens for language modeling.

arxiv.org/abs/2404.07965

*RHO-1: Not All Tokens Are What You Need* by @zebgou @WeizhuChen A small training phase on curated data helps in filtering out useful and harmful tokens for language modeling. arxiv.org/abs/2404.07965
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*DiPaCo: Distributed Path Composition*
by Arthur Douillard Qixuan Feng Andrei A. Rusu Ionel Gog Marc'Aurelio Ranzato

MoE-like models may be fundamental for transcontinental training of large models, by sharding data *and model's paths* across locations.

arxiv.org/abs/2403.10616

*DiPaCo: Distributed Path Composition* by @Ar_Douillard @qixuan_feng @andreialexrusu @ICGog @MarcRanzato MoE-like models may be fundamental for transcontinental training of large models, by sharding data *and model's paths* across locations. arxiv.org/abs/2403.10616
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

Need some arctic in your life?

We have open PhD/Postdocs on relational graph and temporal ML for energy analytics! 🔥

Top-tier research w/ competitive salaries, hosted in the beautiful UiT in Norway & supervised by Filippo Maria Bianchi.

All details here -> en.uit.no/project/relay

Need some arctic in your life? We have open PhD/Postdocs on relational graph and temporal ML for energy analytics! 🔥 Top-tier research w/ competitive salaries, hosted in the beautiful UiT in Norway & supervised by @FilippoMariaBi1. All details here -> en.uit.no/project/relay
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Mixture-of-Depths: Dynamically allocating compute in transformer-based LMs*
by sam ritter Blake Richards Adam Santoro

A variant of MoEs having only a single expert per block which can be either skipped or applied per-token up to some given capacity.

arxiv.org/abs/2404.02258

*Mixture-of-Depths: Dynamically allocating compute in transformer-based LMs* by @ritterstorm @tyrell_turing @santoroAI A variant of MoEs having only a single expert per block which can be either skipped or applied per-token up to some given capacity. arxiv.org/abs/2404.02258
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Equivariant Adaptation of Large Pretrained Models*
Arnab Siba Smarak Panigrahi Oumar Kaba
Sai Rajeswar

A technique to make pre-trained models robust to learned symmetries by combining them with a 'canonicalization' network.

arxiv.org/abs/2310.01647

*Equivariant Adaptation of Large Pretrained Models* @ArnabMondal96 @sibasmarak @sekoumarkaba @RajeswarSai A technique to make pre-trained models robust to learned symmetries by combining them with a 'canonicalization' network. arxiv.org/abs/2310.01647
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Variational Learning is Effective for Large Deep Networks*
by Nico Daheim Emtiyaz Khan Gian Maria Marconi
Peter Nickl Rio Yokota Thomas Möllenhoff

A variant of Adam provides a scalable algorithm to train networks via variational inference.

arxiv.org/abs/2402.17641

*Variational Learning is Effective for Large Deep Networks* by @ndaheim_ @EmtiyazKhan @ShhhPeaceful @PeterNickl_ @rioyokota @tmoellenhoff A variant of Adam provides a scalable algorithm to train networks via variational inference. arxiv.org/abs/2402.17641
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference*
by Piotr Nawrot Adrian Lancucki Edoardo Ponti

A dynamic KV cache for LLM generation that can be trained to satisfy a given memory budget.

arxiv.org/abs/2403.09636

*Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference* by @p_nawrot @AdrianLancucki @PontiEdoardo A dynamic KV cache for LLM generation that can be trained to satisfy a given memory budget. arxiv.org/abs/2403.09636
account_circle
Simone Scardapane(@s_scardapane) 's Twitter Profile Photo

*Vision Transformer (ViT) Prisma Library*
by Sonia Joseph Praneet Suresh Yash Vadi

Simple library to perform basic 'mechanistic interpretability' visualizations such as the logit lens on vision models (ViTs, CLIP).

github.com/soniajoseph/Vi…

*Vision Transformer (ViT) Prisma Library* by @soniajoseph_ @magicProgrammer @IamYashVadi Simple library to perform basic 'mechanistic interpretability' visualizations such as the logit lens on vision models (ViTs, CLIP). github.com/soniajoseph/Vi…
account_circle