 
                                Michal Golovanevsky
@michalgolov
CS PhD student @BrownCSDept | Multimodal Learning | Mechanistic Interpretability | Clinical Deep Learning.
ID: 1573399875278049280
https://github.com/michalg04 23-09-2022 19:52:33
22 Tweet
32 Takipçi
42 Takip Edilen
 
         
         
         
         
         
         
        The finding that important attention heads implement one of a small set of interpretable functions boosts transparency and trust in VLMs. Michal Golovanevsky Vedant Palit #nlp #mechinterp Paper: export.arxiv.org/pdf/2406.16320 GitHub: github.com/wrudman/NOTICE… [5/5]
 
         
         
         
         
         
        How do VLMs balance visual information presented in-context with linguistic priors encoded in-weights? In this project, Michal Golovanevsky and William Rudman find out! My favorite result: you can find a vector that shifts attention to image tokens and changes the VLM's response!
 
        ![William Rudman (@williamrudmanjr) on Twitter photo By visualizing cross-attention patterns, we've discovered that these universal heads fall into three functional categories: implicit image segmentation, object inhibition, and outlier inhibition  [4/5]. By visualizing cross-attention patterns, we've discovered that these universal heads fall into three functional categories: implicit image segmentation, object inhibition, and outlier inhibition  [4/5].](https://pbs.twimg.com/media/GRbNUQBaYAAf7RL.jpg) 
                        ![William Rudman (@williamrudmanjr) on Twitter photo How do VLMs like BLIP and LLaVA differ in how they process visual information? Using our mech-interp pipeline for VLMs, NOTICE, we first show important cross-attention heads in BLIP can perform image grounding, whereas important self-attention heads in LLaVA do not. [1/5] How do VLMs like BLIP and LLaVA differ in how they process visual information? Using our mech-interp pipeline for VLMs, NOTICE, we first show important cross-attention heads in BLIP can perform image grounding, whereas important self-attention heads in LLaVA do not. [1/5]](https://pbs.twimg.com/media/Gamh0H3WEAA-Tw0.jpg) 
                        ![William Rudman (@williamrudmanjr) on Twitter photo Instead, LLaVA relies on self-attention heads to manage “outlier” attention patterns in the image, focusing on regulating these outliers. Interestingly, some of BLIP's attention heads are also dedicated to reducing attention to outlier features. [2/5] Instead, LLaVA relies on self-attention heads to manage “outlier” attention patterns in the image, focusing on regulating these outliers. Interestingly, some of BLIP's attention heads are also dedicated to reducing attention to outlier features. [2/5]](https://pbs.twimg.com/media/Gamh981X0AA_Vl_.jpg) 
                        ![William Rudman (@williamrudmanjr) on Twitter photo NOTICE uses Symmetric Token Replacement for text corruption and Semantic Image Pairs (SIP) for image corruption. SIP replaces clean images with ones differing in a single semantic property, such as object or emotion, enabling meaningful causal mediation analysis of VLMs.  [3/5] NOTICE uses Symmetric Token Replacement for text corruption and Semantic Image Pairs (SIP) for image corruption. SIP replaces clean images with ones differing in a single semantic property, such as object or emotion, enabling meaningful causal mediation analysis of VLMs.  [3/5]](https://pbs.twimg.com/media/GamiFHGXwAAYrUk.jpg) 
                        ![William Rudman (@williamrudmanjr) on Twitter photo We extend the generalizability of NOTICE by using Stable-Diffusion to generate semantic image pairs and find results are nearly identical to curated semantic image pairs. [4/5] We extend the generalizability of NOTICE by using Stable-Diffusion to generate semantic image pairs and find results are nearly identical to curated semantic image pairs. [4/5]](https://pbs.twimg.com/media/GamiMKEX0AA_5AR.jpg) 
                        ![William Rudman (@williamrudmanjr) on Twitter photo When vision-language models answer questions, are they truly analyzing the image or relying on memorized facts? We introduce Pixels vs. Priors (PvP), a method to control whether VLMs respond based on input pixels or world knowledge priors. [1/5] When vision-language models answer questions, are they truly analyzing the image or relying on memorized facts? We introduce Pixels vs. Priors (PvP), a method to control whether VLMs respond based on input pixels or world knowledge priors. [1/5]](https://pbs.twimg.com/media/Gsh4zUCXkAATdnQ.jpg) 
                        ![William Rudman (@williamrudmanjr) on Twitter photo Models rely on memorized priors early in their processing but shift toward visual evidence in mid-to-late layers. This shows a competition between visual input and stored knowledge, with pixels often overriding priors at the final prediction. [3/5] Models rely on memorized priors early in their processing but shift toward visual evidence in mid-to-late layers. This shows a competition between visual input and stored knowledge, with pixels often overriding priors at the final prediction. [3/5]](https://pbs.twimg.com/media/Gsh5z7cW4AAgqSW.png)