Phillip (Yuseung) Lee (@yuseungleee) 's Twitter Profile
Phillip (Yuseung) Lee

@yuseungleee

PhD student @KAIST_AI | Computer Vision, Generative Models

ID: 1540674379247685633

linkhttps://phillipinseoul.github.io calendar_today25-06-2022 12:33:00

944 Tweet

290 Takipçi

379 Takip Edilen

Kosta Derpanis (@csprofkgd) 's Twitter Profile Photo

Term’s over, finally time to binge-watch lectures! Not just for the content, but to pick up new way to present material too. Always learning. Minhyuk Sung has an amazing set of videos of his various courses posted on YouTube (very brave 💪). Check them out!

Term’s over, finally time to binge-watch lectures!  Not just for the content, but to pick up new way to present material too.  Always learning.

<a href="/MinhyukSung/">Minhyuk Sung</a> has an amazing set of videos of his various courses posted on YouTube (very brave 💪).  Check them out!
Shiqi Chen (@shiqi_chen17) 's Twitter Profile Photo

🚀🔥 Thrilled to announce our ICML25 paper: "Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas"! We dive into the core reasons behind spatial reasoning difficulties for Vision-Language Models from an attention mechanism view. 🌍🔍 Paper:

Xindi Wu (@cindy_x_wu) 's Twitter Profile Photo

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦

arxiv.org/abs/2504.21850

1/10
Fang Jiading (@jiading_fang) 's Twitter Profile Photo

After a lengthy 8m13s thinking process, powerful visual reasoning model today OpenAI still failed at one of the most elementary visual reasoning tasks (I bet you can do it right).

After a lengthy 8m13s thinking process, powerful visual reasoning model today OpenAI still failed at one of the most elementary visual reasoning tasks (I bet you can do it right).
Chun-Hsiao (Daniel) Yeh (@danielyehhh) 's Twitter Profile Photo

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: danielchyeh.github.io/All-Angles-Ben…

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans?

🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes.

📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o.

🌐 Project: danielchyeh.github.io/All-Angles-Ben…
Cristóbal Valenzuela (@c_valenzuelab) 's Twitter Profile Photo

This update brings a completely new set of creative capabilities and improvements to References. An interesting emergent property is the ability of the model to precisely place objects in your scene using a layout you can provide. If you find new use cases, please share them.

This update brings a completely new set of creative capabilities and improvements to References. An interesting emergent property is the ability of the model to precisely place objects in your scene using a layout you can provide. 

If you find new use cases, please share them.
Yi Xu (@_yixu) 's Twitter Profile Photo

🚀Let’s Think Only with Images. No language and No verbal thought.🤔 Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨. We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images.

🚀Let’s Think Only with Images.

No language and No verbal thought.🤔 

Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨. 

We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images.
Iacopo Masi (@_iac) 's Twitter Profile Photo

🔥 News - 2nd Unlearning and Model Editing Workshop and Challenge at #ICCV2025 📃 Call for papers ready and OpenReview accepting submissions: bit.ly/4knWGv2 🧩 New challenge on #Unlearning bit.ly/43GkzbK / unlearning.iab-rubric.org Best performers in paper!

🔥 News - 2nd Unlearning and Model Editing Workshop and Challenge at #ICCV2025 

📃 Call for papers ready and OpenReview accepting submissions: bit.ly/4knWGv2

 🧩 New challenge on #Unlearning bit.ly/43GkzbK  / unlearning.iab-rubric.org Best performers in paper!
Yue Fan (@yfan_ucsc) 's Twitter Profile Photo

Before o3 impressed everyone with 🔥visual reasoning🔥, we already had faith in and were exploring models that can think with images. 🚀 Here’s our shot, GRIT: Grounded Reasoning with Images & Texts that trains MLLMs to think while performing visual grounding. It is done via RL

Before o3 impressed everyone with 🔥visual reasoning🔥, we already had faith in and were exploring models that can think with images. 🚀

Here’s our shot, GRIT: Grounded Reasoning with Images &amp; Texts that trains MLLMs to think while performing visual grounding. It is done via RL
Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement

Demis Hassabis (@demishassabis) 's Twitter Profile Photo

It’s kind of mindblowing how good Veo 3 is at modeling intuitive physics. Our world models are getting pretty good, & in my view has important implications regarding the computational complexity of the world - the last line of my bio for me has always been the ultimate quest ⬆️

Sayak Paul (@risingsayak) 's Twitter Profile Photo

Open-sourcing nanoDiT -- an educational repository to show rectified-flow training of class-conditional DiTs for image generation (~600 LoC). Hope that helps: github.com/sayakpaul/nano…

Zhengzhong Tu (@_vztu) 's Twitter Profile Photo

🔥🔥 Introducing 𝗩𝗟𝗠-𝟯𝗥: 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 with Instruction-Aligned 𝟯𝗗 𝗥econstruction 📡 Monocular videos are everywhere, yet current VLMs struggle to extract deep 🛰️ 𝗦𝗽𝗮𝘁𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 from them. Existing methods often rely

Fangfu Liu (@fangfu0830) 's Twitter Profile Photo

Elevate Visual-Spatial Intelligence with Spatial-MLLM! 🚀🚀🚀 Discover how we incorporate 3D information to help MLLMs better think in space in our work: Spatial-MLLM. 🔗Code: github.com/diankun-wu/Spa… 🌐Project Page: diankun-wu.github.io/Spatial-MLLM/ 📄Paper: arxiv.org/abs/2505.23747

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

Chengzhi Liu (@liuchen02938149) 's Twitter Profile Photo

🧠 More Thinking, Less Seeing? 👀 Exploring the Balance Between Reasoning and Hallucination in Multimodal Reasoning Models! Currently many multimodal reasoning models while striving for enhanced reasoning capabilities often neglect the issue of visual hallucinations. While

🧠  More Thinking, Less Seeing? 👀 Exploring the Balance Between Reasoning and Hallucination in Multimodal Reasoning Models! 

Currently many multimodal reasoning models while striving for enhanced reasoning capabilities often neglect the issue of visual hallucinations.  While
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Hidden in plain sight: VLMs overlook their visual representations "Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance." "VLMs are

Hidden in plain sight: VLMs overlook their visual representations

"Across a series of vision-centric benchmarks (e.g., depth estimation,  correspondence), we find that VLMs perform substantially worse than  their visual encoders, dropping to near-chance performance."

"VLMs are
Yunzhi Zhang (@zhang_yunzhi) 's Twitter Profile Photo

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

Constantin Venhoff (@cvenhoff00) 's Twitter Profile Photo

🔍 New paper: How do vision-language models actually align visual- and language representations? We used sparse autoencoders to peek inside VLMs and found something surprising about when and where cross-modal alignment happens! Presented at XAI4CV Workshop @ CVPR 🧵 (1/6)

🔍 New paper: How do vision-language models actually align visual- and language representations?

We used sparse autoencoders to peek inside VLMs and found something surprising about when and where cross-modal alignment happens!

Presented at XAI4CV Workshop @ CVPR
 
🧵 (1/6)