Academic Mercenary (@tarrlab) Twitter Tweets • TwiCopy

Academic Mercenary

@tarrlab

+ Follow

Michael Tarr's lab @CarnegieMellon

ID: 1445464455865716742

linkhttp://tarrlab.org calendar_today05-10-2021 19:03:18

115 Tweet

222 Takipçi

139 Takip Edilen

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️

⏰ Promotion Period: January 15th - Feburary 15th, 2025

👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Josh Dzieza

2 years ago

For The Verge and New York Magazine, I spoke with the people teaching the chatbots how to chat theverge.com/features/23764…

thumb_up_off_alt374

chat_bubble_outline15

repeat143

shareShare

Academic Mercenary

2 years ago

With Nadine Chang we introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions from large-scale, visual datasets. Timely for new article from Josh Dzieza in The Verge and New York Magazine arXiv preprint: arxiv.org/abs/2306.14035

With <a href="/nadinechang430/">Nadine Chang</a> we introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions from large-scale, visual datasets.

Timely for new article from <a href="/joshdzieza/">Josh Dzieza</a> in <a href="/verge/">The Verge</a> and <a href="/NYMag/">New York Magazine</a>

arXiv preprint: arxiv.org/abs/2306.14035

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

AuditoryLab

@auditorylabinfo

2 years ago

Worked hard, learned a lot, and met a great team during the first-ever competition for generative everyday sounds at DCASE Challenge, Task 7 Foley sound! Keunwoo Choi Jaekwon Im KeisukeImoto Shinnosuke Takamichi / 高道慎之介 Yuki Okamoto Mathieu Lagrange, Brian McFee.

thumb_up_off_alt5

chat_bubble_outline1

repeat2

shareShare

AuditoryLab

@auditorylabinfo

2 years ago

Our lab at Carnegie Mellon University is recruiting 18+ #misophonics in the Pittsburgh area for an in-person study on #misophonia! 🔊 1.5-2hrs at $12/hr 🔊 Listen to everyday sounds while being monitored by our team Please share with your networks! More info below

Our lab at <a href="/CarnegieMellon/">Carnegie Mellon University</a> is recruiting 18+ #misophonics in the Pittsburgh area for an in-person study on #misophonia!
🔊 1.5-2hrs at $12/hr
🔊 Listen to everyday sounds while being monitored by our team
Please share with your networks! More info below

thumb_up_off_alt5

chat_bubble_outline0

repeat5

shareShare

Gabriel Sarch

2 years ago

1/ “Prepare my evening tea” How can we enable embodied agents to execute open-domain instructions and personalized requests? Excited to share our work HELPER, an open-ended, instructable agent with memory-augmented LLMs, accepted at #EMNLP2023! 🤖 helper-agent-llm.github.io

thumb_up_off_alt23

chat_bubble_outline1

repeat7

shareShare

Gabriel Sarch

2 years ago

I’ll be at #NeurIPS all week (11-17) and will be presenting Brain Dissection⬇️ Poster #418 Thursday! Also am very excited to chat about: - LLM/VLM/MLLM for decision making and embodied agents - personalized and memory-augmented agents - neuronAI Reach out if you want to chat!

thumb_up_off_alt36

chat_bubble_outline1

repeat8

shareShare

Gabriel Sarch

24 days ago

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

thumb_up_off_alt420

chat_bubble_outline11

repeat57

shareShare

Gabriel Sarch

24 days ago

We ran a behavioral analysis of vanilla RL on small base VLMs and found: – Minimal reference to fine-grained visual cues – Poor exploration of visual regions – Over-reliance on abstract, ungrounded heuristics—“look-once & guess” behavior ❌

thumb_up_off_alt7

chat_bubble_outline1

repeat1

shareShare

Gabriel Sarch

24 days ago

Why does vanilla RL fail here? RL can only amplify base behaviors: - Pretrained VLMs bias toward abstract scene references not region analysis - Accuracy-only rewards reinforce this We argue: grounding each thought shifts models toward iterative, perceptually guided reasoning.

thumb_up_off_alt6

chat_bubble_outline1

repeat1

shareShare

Gabriel Sarch

24 days ago

Our grounded RL recipe: – Warm-start on MCTS-generated (x, y)-anchored chains – GRPO RL to reinforce correct + grounded steps – Multi-turn RL for zooming into visual detail Result? A model that iteratively attends, verifies, and self-corrects, grounding thoughts in the scene.

Our grounded RL recipe:

– Warm-start on MCTS-generated (x, y)-anchored chains
– GRPO RL to reinforce correct + grounded steps
– Multi-turn RL for zooming into visual detail

Result? A model that iteratively attends, verifies, and self-corrects, grounding thoughts in the scene.

thumb_up_off_alt9

chat_bubble_outline1

repeat1

shareShare

Gabriel Sarch

24 days ago

ViGoRL delivers gains across tasks: – V*Bench (Search): 86.4%, outperforms GPT-4o & tool-use – SAT-2: +12.9 pts, BLINK: +2.0 pts vs GRPO – ScreenSpot-Pro: +3.3 pts, VisualWebArena: +3.0 pts (vision-only) – Removing grounding hurts VLMs that ground their reasoning see better.

thumb_up_off_alt5

chat_bubble_outline2

repeat1

shareShare

Gabriel Sarch

24 days ago

Why does grounding help? Explicit grounding boosts helpful visual behaviors lacking in vanilla RL: - More visual region exploration (3.5 vs 1.8) - Frequent visual verification (0.39 vs 0.14) - Does backtracking (0.47) RL + Grounding = richer reasoning 🧠

Why does grounding help? Explicit grounding boosts helpful visual behaviors lacking in vanilla RL:

- More visual region exploration (3.5 vs 1.8)
- Frequent visual verification (0.39 vs 0.14)
- Does backtracking (0.47)

RL + Grounding = richer reasoning 🧠

thumb_up_off_alt9

chat_bubble_outline1

repeat1

shareShare

Gabriel Sarch

24 days ago

We show spatial grounding in reasoning is accurate and aids human interpretability! Human evaluations confirm: - High accuracy in visual grounding of thoughts - Grounded reasoning steps significantly help human understanding of reasoning (especially when grounding is correct!)🧐

We show spatial grounding in reasoning is accurate and aids human interpretability!

Human evaluations confirm:
- High accuracy in visual grounding of thoughts
- Grounded reasoning steps significantly help human understanding of reasoning (especially when grounding is correct!)🧐

thumb_up_off_alt5

chat_bubble_outline1

repeat1

shareShare

Gabriel Sarch

24 days ago

Cognitive scientists have long studied deictic reference and feature binding for how humans use eye movements as spatial pointers to bind abstract variables to perceptual content, enabling compositional and goal-directed reasoning. ViGoRL provides evidence for this in VLMs!

thumb_up_off_alt11

chat_bubble_outline1

repeat1

shareShare

Gabriel Sarch

24 days ago

👉 Paper: arxiv.org/abs/2505.23678 👉 Project page: visually-grounded-rl.github.io This was an amazing collaboration with co-authors Snigdha Saha, Naitik Khandelwal, and Ayush Jain, together with faculty Katerina Fragkiadaki, Aviral Kumar, and Academic Mercenary.

thumb_up_off_alt18

chat_bubble_outline0

repeat2

shareShare

Jacob Yeung

12 days ago

1/6 🚀 Excited to share that BrainNRDS has been accepted as an oral at #CVPR2025! We decode motion from fMRI activity and use it to generate realistic reconstructions of videos people watched, outperforming strong existing baselines like MindVideo and Stable Video Diffusion.🧠🎥

thumb_up_off_alt35

chat_bubble_outline2

repeat12

shareShare

Jacob Yeung

12 days ago

2/6 Instead of directly generating the video from fMRI data, we first decode the motion, then use the motion to generate the video.

thumb_up_off_alt6

chat_bubble_outline1

repeat1

shareShare

Jacob Yeung

11 days ago

3/6 Our method leads to more accurate object-level motion decoding. We achieve significantly lower end-point error than baselines that try to generate videos directly (e.g., MindVideo, Stable Video Diffusion).

3/6 Our method leads to more accurate object-level motion decoding.

We achieve significantly lower end-point error than baselines that try to generate videos directly (e.g., MindVideo, Stable Video Diffusion).

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Jacob Yeung

11 days ago

4/6 BrainNRDS also achieves better video reconstruction with an fMRI-generated initial frame.

4/6 BrainNRDS also achieves better video reconstruction with an fMRI-generated initial frame.

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Jacob Yeung

11 days ago

5/6 We also show that dynamic information isn’t just useful for generation. It is key to understanding brain activity as well. We observe that video models predict brain responses to dynamic scenes better than image models, especially in visual and somatosensory cortices.

5/6 We also show that dynamic information isn’t just useful for generation. It is key to understanding brain activity as well.

We observe that video models predict brain responses to dynamic scenes better than image models, especially in visual and somatosensory cortices.

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare