Pooyan Rahmanzadehgervi (@pooyanrg) 's Twitter Profile
Pooyan Rahmanzadehgervi

@pooyanrg

CS PhD student @AuburnU

ID: 1732457691434360832

calendar_today06-12-2023 17:51:38

14 Tweet

25 Followers

82 Following

Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

👋 Excited to share our BlindTest benchmark, a suite of 7 **ridiculously simple**, low-level visual tasks that 4 SOTA vision-language models ( GPT-4o, Claude 3/3.5 Sonnet, Gemini 1.5 Pro ) can't perform very well ‼️ Paper, code & data: vlmsareblind.github.io

Ethan Mollick (@emollick) 's Twitter Profile Photo

There are a lot of papers suggesting that GPT-4V is not good enough for many tasks, like reading charts or offering diagnoses or reading maps. But it does well navigating smart phone UIs & doing OCR, high-value near-term use cases. The real question is how much vision improves.

There are a lot of papers suggesting that GPT-4V is not good enough for many tasks, like reading charts or offering diagnoses or reading maps.

But it does well navigating smart phone UIs & doing OCR, high-value near-term use cases.

The real question is how much vision improves.
Gary Marcus (@garymarcus) 's Twitter Profile Photo

In 1969, Minsky & Papert showed that two-layered neural nets couldn’t learn to determine whether a figure was connected or not. Fast forward to 2024, with vastly more data, layers & compute, and neural nets (as we now know how to make them) still fail at some really basic stuff:

In 1969, Minsky & Papert showed that two-layered neural nets couldn’t learn to determine whether a figure was connected or not.

Fast forward to 2024, with vastly more data, layers & compute, and neural nets (as we now know how to make them) still fail at some really basic stuff:
Aritra R G (@arig23498) 's Twitter Profile Photo

Counting the number of intersections in this simple image turns out to be surprisingly challenging for Vision-Language Models (VLMs). This task was introduced in the 'VLMs are blind' paper, and I discovered it through an insightful blog post by Lucas Beyer (bl16). [1/N]

Counting the number of intersections in this simple image turns out to be surprisingly challenging for Vision-Language Models (VLMs).

This task was introduced in the 'VLMs are blind' paper, and I discovered it through an insightful blog post by <a href="/giffmana/">Lucas Beyer (bl16)</a>.

[1/N]
Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

2 weeks ago OpenAI reported o3's huge improvement on VLMsAreBlind (from 50.4-> 90.1%). 🌟openai.com/index/thinking… ➡️ My take: o3 vision may not be better than o1, but o3 knows how to use tools/code well to aid solving many visual tasks! This is a great progress by itself! BUT...

2 weeks ago OpenAI reported o3's huge improvement on VLMsAreBlind (from 50.4-&gt; 90.1%). 🌟openai.com/index/thinking…

➡️ My take: o3 vision may not be better than o1, but o3 knows how to use tools/code well to aid solving many visual tasks!
This is a great progress by itself!
BUT...
Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

How do best AI image editors 🤖 GPT-4o, Gemini 2.0, SeedEdit, HF 🤗 fare ⚔️ human Photoshop wizards 🧙‍♀️ on text-based 🏞️ image editing? Logan Logan Bolton and Brandon Brandon Collins shared some answers at our poster today! #CVPR2025 psrdataset.github.io A few insights 👇

Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

Pooyan Pooyan Rahmanzadehgervi presenting our Transformer Attention Bottleneck paper at #CVPR2026 💡 We **simplify** MHSA (e.g. 12 heads -> 1 head) to create an attention **bottleneck** where users can debug Vision Language Models by editing the bottleneck and observe expected VLM text outputs.

Hokin Deng (@denghokin) 's Twitter Profile Photo

#embodied All forms of biological intelligence are grounded movements🏃‍♂️ muscles & motor neurons 🧠 emerge before visual cortex & rods & cones in eyes 👁️ Building monocular better-than-mocap-studio #video2motion is our critical step towards human embodied intelligence.