Tony Wu
@tonywu_71
Using my TFLOPS for RAG in Vision Space | ColPali co-first author 📝 | @centralesupelec 🇫🇷 x @Cambridge_Uni 🇬🇧 | @illuintech 🧑🏻💻
ID: 1492547583830634510
https://tonywu71.notion.site/Hi-I-m-Tony-e937d2baf5ab4669904b04fd24513499?pvs=74 12-02-2022 17:15:02
231 Tweet
1,1K Takipçi
270 Takip Edilen
Twitter! I can't believe no one told me that people are using vision encoders to retrieve document information these days. Based on a tip from Nadav Timor I read the "ColPali: Efficient Document Retrieval with Vision Language Models" paper, and it is very cool.
I vibe coded a visual PDF search app with ColQwen2. This is how it works: - Store PDF files as images in a Weaviate vector database vector database - Embed images and text with a multimodal late-interaction model (ColQwen2) - Generate token-wise (and summed) similarity maps to highlight
Wanna upgrade your agent game? With AI at Meta , we're releasing 2 incredibly cool artefacts: - GAIA 2: assistant evaluation with a twist (new: adaptability, robustness to failure & time sensitivity) - ARE, an agent research environment to empower all! huggingface.co/blog/gaia2
MetaEmbed is a cool new paper by Zilin Xiao in which we append extra writeable "memory tokens" at the end of ColPali tokens and only store and use those for Late Interaction. This reduces the memory footprint, yet retains rich query/doc granular interaction that scale well
alrighty, publicly sharing my slide deck for multimodal AI, covering ⤵️ > trends & uses > cool open-source models > tools to customize/deploy multimodal models > further resources all models in this presentation are on Hugging Face, easy load with 2 LoC!