Peter Tong (@tongpetersb) 's Twitter Profile
Peter Tong

@tongpetersb

Berkeley 23', CS PhD Student in NYU Courant advised by Professor @ylecun and Professor @sainingxie

ID: 1551975677552902152

linkhttps://tsb0601.github.io/petertongsb/ calendar_today26-07-2022 17:00:19

92 Tweet

1,1K Followers

113 Following

David Fan (@davidjfan) 's Twitter Profile Photo

Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

Can visual SSL match CLIP on VQA?

Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
David Fan (@davidjfan) 's Twitter Profile Photo

[7/8] This side project started in October when Peter Tong, Amir Bar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we

Saining Xie (@sainingxie) 's Twitter Profile Photo

In Cambrian-1, we found that vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. We’ve tried scaling SSL before, but this is the first time I’ve seen real signal: SSL adapts to

Yann LeCun (@ylecun) 's Twitter Profile Photo

New paper from FAIR+NYU: Q: Is language supervision required to learn effective visual representations for multimodal tasks? A: No. ⬇️⬇️⬇️

David Fan (@davidjfan) 's Twitter Profile Photo

Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and

Saining Xie (@sainingxie) 's Twitter Profile Photo

Recently open-sourced projects from Peter Tong, David Fan, and the team at Meta FAIR. MetaMorph (training code and model weights): github.com/facebookresear… Web-SSL (model weights for Web-DINO and Web-MAE) github.com/facebookresear… FAIR's still leading the way in open research.

David Fan (@davidjfan) 's Twitter Profile Photo

Web-SSL model weights are now available on GitHub and HuggingFace! You may use your favorite Transformers library API calls or load the model with native PyTorch - up to your preference. For more usage details, please see github.com/facebookresear… HuggingFace collection:

Xindi Wu (@cindy_x_wu) 's Twitter Profile Photo

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦

arxiv.org/abs/2504.21850

1/10