Xiaohan Wang (@xiaohanwang96) Twitter Tweets • TwiCopy

Xiaohan Wang

a year ago

Rethinking visual token pruning for vision-language models from a vision-centric perspective—fantastic work led by Mark Endo!

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Wenhu Chen

@wenhuchen

10 months ago

Another Microsoft paper revealing the size of GPT-4, GPT-o1 and Claude Sonnet. I'm not sure how trustworthy these numbers are, but they do make a lot of sense to me. Source: arxiv.org/pdf/2412.19260

thumb_up_off_alt1,1K

chat_bubble_outline36

repeat137

shareShare

🔍 Vision language models are getting better - but how do we evaluate them reliably? Introducing AutoConverter: transforming open-ended VQA into challenging multiple-choice questions! Key findings: 1️⃣ Current open-ended VQA eval methods are flawed: rule-based metrics correlate

thumb_up_off_alt155

chat_bubble_outline3

repeat74

shareShare

Alejandro Lozano

@ale9806_

10 months ago

Biomedical datasets are often confined to specific domains, missing valuable insights from adjacent fields. To bridge this gap, we present BIOMEDICA: an open-source framework to extract and serialize PMC-OA. 📄Paper: lnkd.in/dUUgA6rR 🌐Website: lnkd.in/dnqZZW4M

thumb_up_off_alt145

chat_bubble_outline13

repeat54

shareShare

Junyang Lin

@justinlin610

10 months ago

Qwen2.5-VL! Qwen2.5-VL! Qwen2.5-VL! Try our new Qwen2.5-VL in Qwen Chat, chat.qwenlm.ai Finally, after months, we release the new version of our vision language model, Qwen2.5-VL! This time, we focus on more essential problems. Notably, we highlight the importance of

thumb_up_off_alt653

chat_bubble_outline40

repeat96

shareShare

Christopher Manning

@chrmanning

10 months ago

Re: “Every major breakthrough in AI has been American”: America does itself no favors when it overestimates its specialness. Yes, the center of the AI industry is the US (California!), but many of the breakthroughs of (neural, gradient-based) AI happened elsewhere: • LSTMs,

thumb_up_off_alt2,2K

chat_bubble_outline75

repeat341

shareShare

Orr Zohar @ ICLR’25

@orr_zohar

9 months ago

🚨🚨🚨SmolVLM2 is here - and it's a tiny titan! This nano-sized model crushes image and video perception👁️🧠, all while being small enough to run on your iPhone, bringing cutting-edge multimodal AI to every device📲. No more cloud dependence! Your data is yours! #MobileAI

thumb_up_off_alt60

chat_bubble_outline4

repeat26

shareShare

Serena Yeung-Levy

@yeung_levy

9 months ago

Just published in Science Advances, our work demonstrating the ability of AI and 3D computer vision to produce automated measurement of human interactions in video data from early child development research -- providing over 100x time savings compared to human annotation and

thumb_up_off_alt51

chat_bubble_outline2

repeat14

shareShare

James Burgess (at ICLR 2025)

@jmhb0

8 months ago

🚨Large video-language models LLaVA-Video can do single-video tasks. But can they compare videos? Imagine you’re learning a sports skill like kicking: can an AI tell how your kick differs from an expert video? 🚀 Introducing "Video Action Differencing" (VidDiff), ICLR 2025 🧵

thumb_up_off_alt57

chat_bubble_outline7

repeat51

shareShare

Yuhui Zhang

@zhang_yu_hui

8 months ago

Excited to announce that AutoConverter has been accepted to #CVPR2025 and VMCBench is now supported by both VLMEvalKit and lmms-eval! 🎉 Try our tools: ▪️ AutoConverter demo: yuhui-zh15.github.io/AutoConverter-… ▪️ VMCBench: huggingface.co/datasets/suyc2… (supported by VLMEvalKit and lmms-eval)

thumb_up_off_alt37

chat_bubble_outline4

repeat46

shareShare

James Burgess (at ICLR 2025)

@jmhb0

8 months ago

Introducing MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research #CVPR2025 ✅ 1k multimodal reasoning VQAs testing MLLMs for science 🧑‍🔬 Biology researchers manually created the questions 🤖 RefineBot: a method for fixing QA language shortcuts 🧵

thumb_up_off_alt84

chat_bubble_outline3

repeat56

shareShare

Xiaohan Wang

@xiaohanwang96

7 months ago

🚨 Excited to co-organize our #CVPR2025 workshop on "Multimodal Foundation Models for Biomedicine: Challenges and Opportunities" — where vision, language, and health intersect! We’re bringing together experts from #CV, #NLP, and #healthcare to explore: 🧠 Technical challenges (e.g.

🚨 Excited to co-organize our <a href="/CVPR/">#CVPR2025</a> workshop on "Multimodal Foundation Models for Biomedicine: Challenges and Opportunities" — where vision, language, and health intersect!
We’re bringing together experts from #CV, #NLP, and #healthcare to explore:
🧠 Technical challenges (e.g.

thumb_up_off_alt59

chat_bubble_outline0

repeat16

shareShare

Orr Zohar @ ICLR’25

@orr_zohar

7 months ago

🤗The SmolVLM report is out, with all the experiments, findings, and insights that led to high performance at tiny sizes🤏. 📱These models can run on most mobile/edge devices. 📖Give it a look!

thumb_up_off_alt50

chat_bubble_outline0

repeat9

shareShare

Orr Zohar @ ICLR’25

@orr_zohar

7 months ago

Excited to present Video-STaR at #ICLR2025’s poster session tomorrow! 🗓️ Visit me at Poster 91, 10:00 AM–12:30 PM 🚀 Dive into our work on advancing video reasoning using self-training:

thumb_up_off_alt18

chat_bubble_outline0

repeat5

shareShare

Yuhui Zhang

@zhang_yu_hui

6 months ago

📢 The First Workshop on Multimodal Foundation Models for Biomedicine (MMFM-BIOMED) at #CVPR2025 is still accepting submissions until May 7, 11:59 PM PT! Join speakers from Stanford, Google, MIT & more exploring the intersection of #CV, #NLP & #healthcare. Submit your 4-page

thumb_up_off_alt20

chat_bubble_outline0

repeat5

shareShare

Benjamin Feuer

@feuerbenjamin

5 months ago

So excited to announce the DCVLR (Data Curation for Vision-Language Reasoning) competition at NeurIPS 2025, led by Oumi and sponsored by Lambda! 🌟open-data 🌟 🤖 open-models 🤖 💻 open-source 💻 💪anyone can compete for free 💪 dcvlr-neurips.github.io 🧵 1 / n

thumb_up_off_alt36

chat_bubble_outline1

repeat11

shareShare

Xiaohan Wang

@xiaohanwang96

4 months ago

🧠 How can we truly test long-context video understanding in video-LMMs? ⏱️ TimeScope benchmarks models from 1 min to 8 hours using “needle-in-a-haystack” probes. 🚀 Gemini 2.5-Pro leads the pack—but even it struggles as context length grows. Long-range memory is still a

thumb_up_off_alt9

chat_bubble_outline1

repeat1

shareShare