Peter Tong (@tongpetersb) Twitter Tweets • TwiCopy

Peter Tong

@tongpetersb

+ Follow

Berkeley 23', CS PhD Student in NYU Courant advised by Professor @ylecun and Professor @sainingxie

ID: 1551975677552902152

linkhttps://tsb0601.github.io/petertongsb/ calendar_today26-07-2022 17:00:19

92 Tweet

1,1K Takipçi

113 Takip Edilen

David Fan

@davidjfan

8 months ago

Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

thumb_up_off_alt452

chat_bubble_outline12

repeat93

shareShare

David Fan

@davidjfan

8 months ago

[7/8] This side project started in October when Peter Tong, Amir Bar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we

thumb_up_off_alt63

chat_bubble_outline2

repeat3

shareShare

Amir Bar

@_amirbar

8 months ago

FAIR is probably the only lab outside of academia where research projects can start like this.

thumb_up_off_alt111

chat_bubble_outline3

repeat6

shareShare

Saining Xie

@sainingxie

8 months ago

In Cambrian-1, we found that vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. We’ve tried scaling SSL before, but this is the first time I’ve seen real signal: SSL adapts to

thumb_up_off_alt245

chat_bubble_outline2

repeat51

shareShare

Yann LeCun

@ylecun

8 months ago

New paper from FAIR+NYU: Q: Is language supervision required to learn effective visual representations for multimodal tasks? A: No. ⬇️⬇️⬇️

thumb_up_off_alt607

chat_bubble_outline26

repeat68

shareShare

David Fan

@davidjfan

8 months ago

Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and

thumb_up_off_alt23

chat_bubble_outline0

repeat1

shareShare

Saining Xie

@sainingxie

7 months ago

Recently open-sourced projects from Peter Tong, David Fan, and the team at Meta FAIR. MetaMorph (training code and model weights): github.com/facebookresear… Web-SSL (model weights for Web-DINO and Web-MAE) github.com/facebookresear… FAIR's still leading the way in open research.

thumb_up_off_alt87

chat_bubble_outline1

repeat13

shareShare

David Fan

@davidjfan

7 months ago

Web-SSL model weights are now available on GitHub and HuggingFace! You may use your favorite Transformers library API calls or load the model with native PyTorch - up to your preference. For more usage details, please see github.com/facebookresear… HuggingFace collection:

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

Xindi Wu

@cindy_x_wu

7 months ago

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

thumb_up_off_alt149

chat_bubble_outline6

repeat42

shareShare

Chun-Hsiao (Daniel) Yeh

@danielyehhh

7 months ago

❗️❗️ Can MLLMs understand scenes from multiple camera viewpoints — like humans? 🧭 We introduce All-Angles Bench — 2,100+ QA pairs on multi-view scenes. 📊 We evaluate 27 top MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o. 🌐 Project: danielchyeh.github.io/All-Angles-Ben…

thumb_up_off_alt78

chat_bubble_outline2

repeat27

shareShare

Chun-Hsiao (Daniel) Yeh

@danielyehhh

7 months ago

🚀 Glad to see our All-Angles Bench (github.com/Chenyu-Wang567…) being adopted to evaluate 3D spatial understanding in Seed-1.5-VL-thinking along with OpenAI (o1) and Gemini 2.5 Pro..!

thumb_up_off_alt23

chat_bubble_outline0

repeat8

shareShare