🔍 New paper: How do vision-language models actually align visual- and language representations?
We used sparse autoencoders to peek inside VLMs and found something surprising about when and where cross-modal alignment happens!
Presented at XAI4CV Workshop @ CVPR
🧵 (1/6)