Academic Mercenary (@tarrlab) 's Twitter Profile
Academic Mercenary

@tarrlab

Michael Tarr's lab @CarnegieMellon

ID: 1445464455865716742

linkhttp://tarrlab.org calendar_today05-10-2021 19:03:18

115 Tweet

222 Takipçi

139 Takip Edilen

Academic Mercenary (@tarrlab) 's Twitter Profile Photo

With Nadine Chang we introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions from large-scale, visual datasets. Timely for new article from Josh Dzieza in The Verge and New York Magazine arXiv preprint: arxiv.org/abs/2306.14035

With <a href="/nadinechang430/">Nadine Chang</a> we introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions from large-scale, visual datasets.

Timely for new article from <a href="/joshdzieza/">Josh Dzieza</a> in <a href="/verge/">The Verge</a> and <a href="/NYMag/">New York Magazine</a> 

arXiv preprint: arxiv.org/abs/2306.14035
AuditoryLab (@auditorylabinfo) 's Twitter Profile Photo

Our lab at Carnegie Mellon University is recruiting 18+ #misophonics in the Pittsburgh area for an in-person study on #misophonia! 🔊 1.5-2hrs at $12/hr 🔊 Listen to everyday sounds while being monitored by our team Please share with your networks! More info below

Our lab at <a href="/CarnegieMellon/">Carnegie Mellon University</a> is recruiting 18+ #misophonics in the Pittsburgh area for an in-person study on #misophonia!
🔊 1.5-2hrs at $12/hr
🔊 Listen to everyday sounds while being monitored by our team
Please share with your networks! More info below
Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

1/ “Prepare my evening tea” How can we enable embodied agents to execute open-domain instructions and personalized requests? Excited to share our work HELPER, an open-ended, instructable agent with memory-augmented LLMs, accepted at #EMNLP2023! 🤖 helper-agent-llm.github.io

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

I’ll be at #NeurIPS all week (11-17) and will be presenting Brain Dissection⬇️ Poster #418 Thursday! Also am very excited to chat about: - LLM/VLM/MLLM for decision making and embodied agents - personalized and memory-augmented agents - neuronAI Reach out if you want to chat!

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

We ran a behavioral analysis of vanilla RL on small base VLMs and found: – Minimal reference to fine-grained visual cues – Poor exploration of visual regions – Over-reliance on abstract, ungrounded heuristics—“look-once & guess” behavior ❌

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

Why does vanilla RL fail here? RL can only amplify base behaviors: - Pretrained VLMs bias toward abstract scene references not region analysis - Accuracy-only rewards reinforce this We argue: grounding each thought shifts models toward iterative, perceptually guided reasoning.

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

Our grounded RL recipe: – Warm-start on MCTS-generated (x, y)-anchored chains – GRPO RL to reinforce correct + grounded steps – Multi-turn RL for zooming into visual detail Result? A model that iteratively attends, verifies, and self-corrects, grounding thoughts in the scene.

Our grounded RL recipe:

– Warm-start on MCTS-generated (x, y)-anchored chains
– GRPO RL to reinforce correct + grounded steps
– Multi-turn RL for zooming into visual detail

Result? A model that iteratively attends, verifies, and self-corrects, grounding thoughts in the scene.
Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

ViGoRL delivers gains across tasks: – V*Bench (Search): 86.4%, outperforms GPT-4o & tool-use – SAT-2: +12.9 pts, BLINK: +2.0 pts vs GRPO – ScreenSpot-Pro: +3.3 pts, VisualWebArena: +3.0 pts (vision-only) – Removing grounding hurts VLMs that ground their reasoning see better.

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

Why does grounding help? Explicit grounding boosts helpful visual behaviors lacking in vanilla RL: - More visual region exploration (3.5 vs 1.8) - Frequent visual verification (0.39 vs 0.14) - Does backtracking (0.47) RL + Grounding = richer reasoning 🧠

Why does grounding help? Explicit grounding boosts helpful visual behaviors lacking in vanilla RL:

- More visual region exploration (3.5 vs 1.8)
- Frequent visual verification (0.39 vs 0.14)
- Does backtracking (0.47)

RL + Grounding = richer reasoning 🧠
Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

We show spatial grounding in reasoning is accurate and aids human interpretability! Human evaluations confirm: - High accuracy in visual grounding of thoughts - Grounded reasoning steps significantly help human understanding of reasoning (especially when grounding is correct!)🧐

We show spatial grounding in reasoning is accurate and aids human interpretability!

Human evaluations confirm:
- High accuracy in visual grounding of thoughts
- Grounded reasoning steps significantly help human understanding of reasoning (especially when grounding is correct!)🧐
Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

Cognitive scientists have long studied deictic reference and feature binding for how humans use eye movements as spatial pointers to bind abstract variables to perceptual content, enabling compositional and goal-directed reasoning. ViGoRL provides evidence for this in VLMs!

Gabriel Sarch (@gabrielsarch) 's Twitter Profile Photo

👉 Paper: arxiv.org/abs/2505.23678 👉 Project page: visually-grounded-rl.github.io This was an amazing collaboration with co-authors Snigdha Saha, Naitik Khandelwal, and Ayush Jain, together with faculty Katerina Fragkiadaki, Aviral Kumar, and Academic Mercenary.

Jacob Yeung (@jacobyeung) 's Twitter Profile Photo

1/6 🚀 Excited to share that BrainNRDS has been accepted as an oral at #CVPR2025! We decode motion from fMRI activity and use it to generate realistic reconstructions of videos people watched, outperforming strong existing baselines like MindVideo and Stable Video Diffusion.🧠🎥

Jacob Yeung (@jacobyeung) 's Twitter Profile Photo

2/6 Instead of directly generating the video from fMRI data, we first decode the motion, then use the motion to generate the video.

Jacob Yeung (@jacobyeung) 's Twitter Profile Photo

3/6 Our method leads to more accurate object-level motion decoding. We achieve significantly lower end-point error than baselines that try to generate videos directly (e.g., MindVideo, Stable Video Diffusion).

3/6 Our method leads to more accurate object-level motion decoding.

We achieve significantly lower end-point error than baselines that try to generate videos directly (e.g., MindVideo, Stable Video Diffusion).
Jacob Yeung (@jacobyeung) 's Twitter Profile Photo

5/6 We also show that dynamic information isn’t just useful for generation. It is key to understanding brain activity as well. We observe that video models predict brain responses to dynamic scenes better than image models, especially in visual and somatosensory cortices.

5/6 We also show that dynamic information isn’t just useful for generation. It is key to understanding brain activity as well. 

We observe that video models predict brain responses to dynamic scenes better than image models, especially in visual and somatosensory cortices.