Phillip (Yuseung) Lee (@yuseungleee) Twitter Tweets • TwiCopy

Kosta Derpanis

7 months ago

Term’s over, finally time to binge-watch lectures! Not just for the content, but to pick up new way to present material too. Always learning. Minhyuk Sung has an amazing set of videos of his various courses posted on YouTube (very brave 💪). Check them out!

Term’s over, finally time to binge-watch lectures! Not just for the content, but to pick up new way to present material too. Always learning.

<a href="/MinhyukSung/">Minhyuk Sung</a> has an amazing set of videos of his various courses posted on YouTube (very brave 💪). Check them out!

thumb_up_off_alt302

chat_bubble_outline2

repeat32

shareShare

Shiqi Chen

@shiqi_chen17

7 months ago

🚀🔥 Thrilled to announce our ICML25 paper: "Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas"! We dive into the core reasons behind spatial reasoning difficulties for Vision-Language Models from an attention mechanism view. 🌍🔍 Paper:

thumb_up_off_alt230

chat_bubble_outline5

repeat36

shareShare

Xindi Wu

@cindy_x_wu

7 months ago

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

thumb_up_off_alt149

chat_bubble_outline6

repeat42

shareShare

Fang Jiading

@jiading_fang

7 months ago

After a lengthy 8m13s thinking process, powerful visual reasoning model today OpenAI still failed at one of the most elementary visual reasoning tasks (I bet you can do it right).

thumb_up_off_alt4

chat_bubble_outline3

repeat1

shareShare

Chun-Hsiao (Daniel) Yeh

@danielyehhh

7 months ago

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: danielchyeh.github.io/All-Angles-Ben…

thumb_up_off_alt78

chat_bubble_outline2

repeat27

shareShare

Cristóbal Valenzuela

@c_valenzuelab

7 months ago

This update brings a completely new set of creative capabilities and improvements to References. An interesting emergent property is the ability of the model to precisely place objects in your scene using a layout you can provide. If you find new use cases, please share them.

thumb_up_off_alt666

chat_bubble_outline42

repeat73

shareShare

Yi Xu

@_yixu

7 months ago

🚀Let’s Think Only with Images. No language and No verbal thought.🤔 Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨. We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images.

thumb_up_off_alt1,1K

chat_bubble_outline13

repeat207

shareShare

Iacopo Masi

@_iac

7 months ago

🔥 News - 2nd Unlearning and Model Editing Workshop and Challenge at #ICCV2025 📃 Call for papers ready and OpenReview accepting submissions: bit.ly/4knWGv2 🧩 New challenge on #Unlearning bit.ly/43GkzbK / unlearning.iab-rubric.org Best performers in paper!

thumb_up_off_alt11

chat_bubble_outline0

repeat6

shareShare

Yue Fan

@yfan_ucsc

7 months ago

Before o3 impressed everyone with 🔥visual reasoning🔥, we already had faith in and were exploring models that can think with images. 🚀 Here’s our shot, GRIT: Grounded Reasoning with Images & Texts that trains MLLMs to think while performing visual grounding. It is done via RL

thumb_up_off_alt165

chat_bubble_outline3

repeat36

shareShare

Wenhu Chen

@wenhuchen

6 months ago

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement

thumb_up_off_alt390

chat_bubble_outline9

repeat62

shareShare

Demis Hassabis

@demishassabis

6 months ago

It’s kind of mindblowing how good Veo 3 is at modeling intuitive physics. Our world models are getting pretty good, & in my view has important implications regarding the computational complexity of the world - the last line of my bio for me has always been the ultimate quest ⬆️

thumb_up_off_alt3,3K

chat_bubble_outline144

repeat315

shareShare

Sayak Paul

@risingsayak

6 months ago

Open-sourcing nanoDiT -- an educational repository to show rectified-flow training of class-conditional DiTs for image generation (~600 LoC). Hope that helps: github.com/sayakpaul/nano…

thumb_up_off_alt229

chat_bubble_outline4

repeat35

shareShare

Zhengzhong Tu

@_vztu

6 months ago

🔥🔥 Introducing 𝗩𝗟𝗠-𝟯𝗥: 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 with Instruction-Aligned 𝟯𝗗 𝗥econstruction 📡 Monocular videos are everywhere, yet current VLMs struggle to extract deep 🛰️ 𝗦𝗽𝗮𝘁𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 from them. Existing methods often rely

thumb_up_off_alt95

chat_bubble_outline1

repeat19

shareShare

Fangfu Liu

@fangfu0830

6 months ago

Elevate Visual-Spatial Intelligence with Spatial-MLLM! 🚀🚀🚀 Discover how we incorporate 3D information to help MLLMs better think in space in our work: Spatial-MLLM. 🔗Code: github.com/diankun-wu/Spa… 🌐Project Page: diankun-wu.github.io/Spatial-MLLM/ 📄Paper: arxiv.org/abs/2505.23747

thumb_up_off_alt172

chat_bubble_outline3

repeat27

shareShare

Gabriel Sarch

@gabrielsarch

6 months ago

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

thumb_up_off_alt420

chat_bubble_outline11

repeat57

shareShare

Chengzhi Liu

@liuchen02938149

6 months ago

🧠 More Thinking, Less Seeing? 👀 Exploring the Balance Between Reasoning and Hallucination in Multimodal Reasoning Models! Currently many multimodal reasoning models while striving for enhanced reasoning capabilities often neglect the issue of visual hallucinations. While

thumb_up_off_alt46

chat_bubble_outline2

repeat22

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

6 months ago

Hidden in plain sight: VLMs overlook their visual representations "Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance." "VLMs are

thumb_up_off_alt306

chat_bubble_outline9

repeat50

shareShare

Yunzhi Zhang

@zhang_yunzhi

6 months ago

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

thumb_up_off_alt296

chat_bubble_outline4

repeat61

shareShare

Constantin Venhoff

@cvenhoff00

6 months ago

🔍 New paper: How do vision-language models actually align visual- and language representations? We used sparse autoencoders to peek inside VLMs and found something surprising about when and where cross-modal alignment happens! Presented at XAI4CV Workshop @ CVPR 🧵 (1/6)

thumb_up_off_alt298

chat_bubble_outline8

repeat44

shareShare