Pooyan Rahmanzadehgervi (@pooyanrg) Twitter Tweets • TwiCopy

Michael Nielsen

@michael_nielsen

2 years ago

Fun: very simple problems that visual models like GPT-4o (currently) fail on: vlmsareblind.github.io

thumb_up_off_alt98

chat_bubble_outline9

repeat19

shareShare

👋 Excited to share our BlindTest benchmark, a suite of 7 **ridiculously simple**, low-level visual tasks that 4 SOTA vision-language models ( GPT-4o, Claude 3/3.5 Sonnet, Gemini 1.5 Pro ) can't perform very well ‼️ Paper, code & data: vlmsareblind.github.io

thumb_up_off_alt102

chat_bubble_outline9

repeat26

shareShare

Anh Nguyen (Totti)

@anh_ng8

2 years ago

Are ‘visual’ AI models actually blind? techcrunch.com/2024/07/11/are… via @techcrunch

thumb_up_off_alt23

chat_bubble_outline0

repeat7

shareShare

Ethan Mollick

@emollick

2 years ago

There are a lot of papers suggesting that GPT-4V is not good enough for many tasks, like reading charts or offering diagnoses or reading maps. But it does well navigating smart phone UIs & doing OCR, high-value near-term use cases. The real question is how much vision improves.

thumb_up_off_alt175

chat_bubble_outline6

repeat31

shareShare

Gary Marcus

@garymarcus

2 years ago

In 1969, Minsky & Papert showed that two-layered neural nets couldn’t learn to determine whether a figure was connected or not. Fast forward to 2024, with vastly more data, layers & compute, and neural nets (as we now know how to make them) still fail at some really basic stuff:

thumb_up_off_alt180

chat_bubble_outline8

repeat38

shareShare

Anh Nguyen (Totti)

@anh_ng8

a year ago

BlindTest (vlmsareblind.github.io) not so easy for O1 🍓

thumb_up_off_alt5

chat_bubble_outline0

repeat2

shareShare

Aritra R G

@arig23498

a year ago

Counting the number of intersections in this simple image turns out to be surprisingly challenging for Vision-Language Models (VLMs). This task was introduced in the 'VLMs are blind' paper, and I discovered it through an insightful blog post by Lucas Beyer (bl16). [1/N]

thumb_up_off_alt88

chat_bubble_outline4

repeat5

shareShare

Anh Nguyen (Totti)

@anh_ng8

9 months ago

2 weeks ago OpenAI reported o3's huge improvement on VLMsAreBlind (from 50.4-> 90.1%). 🌟openai.com/index/thinking… ➡️ My take: o3 vision may not be better than o1, but o3 knows how to use tools/code well to aid solving many visual tasks! This is a great progress by itself! BUT...

thumb_up_off_alt8

chat_bubble_outline1

repeat3

shareShare

Anh Nguyen (Totti)

@anh_ng8

7 months ago

How do best AI image editors 🤖 GPT-4o, Gemini 2.0, SeedEdit, HF 🤗 fare ⚔️ human Photoshop wizards 🧙‍♀️ on text-based 🏞️ image editing? Logan Logan Bolton and Brandon Brandon Collins shared some answers at our poster today! #CVPR2025 psrdataset.github.io A few insights 👇

thumb_up_off_alt12

chat_bubble_outline1

repeat6

shareShare

Anh Nguyen (Totti)

@anh_ng8

7 months ago

Pooyan Pooyan Rahmanzadehgervi presenting our Transformer Attention Bottleneck paper at #CVPR2026 💡 We **simplify** MHSA (e.g. 12 heads -> 1 head) to create an attention **bottleneck** where users can debug Vision Language Models by editing the bottleneck and observe expected VLM text outputs.

thumb_up_off_alt6

chat_bubble_outline1

repeat5

shareShare

Hokin Deng

@denghokin

6 months ago

#embodied All forms of biological intelligence are grounded movements🏃‍♂️ muscles & motor neurons 🧠 emerge before visual cortex & rods & cones in eyes 👁️ Building monocular better-than-mocap-studio #video2motion is our critical step towards human embodied intelligence.

thumb_up_off_alt33

chat_bubble_outline1

repeat17

shareShare

Pooyan Rahmanzadehgervi

@pooyanrg

2 months ago

At this moment, ICLR rebuttal vibe be like 🤣

thumb_up_off_alt12

chat_bubble_outline0

repeat3

shareShare