Xinlei Chen (@endernewton) 's Twitter Profile
Xinlei Chen

@endernewton

Research Scientist at FAIR

ID: 334448097

linkhttp://xinleic.xyz/ calendar_today13-07-2011 03:26:58

48 Tweet

2,2K Followers

827 Following

Xinlei Chen (@endernewton) 's Twitter Profile Photo

Very happy to see the TTT-series reaching yet another milestone! This time it serves as an inspiration for next-generation architecture post-Transformer, and by connecting TTT to Transformer, it can explain why (autoregressive) Transformers are so good at in-context learning!

Leroy Wang (@liruiwang1) 's Twitter Profile Photo

Excited to share Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers (HPT)! We explore the challenging problem of data heterogeneity across embodiments in robotics and investigate the scaling behavior of HPT. Accepted to a #Neurips2024 as Spotlight.

Excited to share Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers (HPT)! We explore the challenging problem of data heterogeneity across embodiments in robotics and investigate the scaling behavior of HPT. Accepted to a #Neurips2024 as Spotlight.
Quinn McIntyre (@qamcintyre) 's Twitter Profile Photo

So excited to share what I have been working on Etched. It was a great honor to work with Julian Quevedo Spruce Xinlei Chen Robert Wachen and to have the chance to collaborate with Decart — interactive video models will be the most impactful interface in the next decade

Julian Quevedo (@julianhquevedo) 's Twitter Profile Photo

oasis is here! it's an interactive diffusion transformer that predicts the next frame autoregressively. here, we used it to create one of the first immersive, generative worlds. and the future possibilities for interactive video are so, so exciting.

Yossi Gandelsman (@ygandelsman) 's Twitter Profile Photo

Current video representation models (e.g. VideoMAE) are inefficient learners. How inefficient? We show that reprs with similar quality can be learned without training on *any* real videos, by using synthetic datasets that were created from very simple generative processes!

Xinlei Chen (@endernewton) 's Twitter Profile Photo

I am looking for an intern to do research together next summer. Possible topics: representation learning, network architecture, and in general understanding what's going on :P. Please apply (metacareers.com/jobs/532549086…) and email me ([email protected]) if interested.

Leroy Wang (@liruiwang1) 's Twitter Profile Photo

HPT will be presented at Neurips in Vancouver East Exhibit Hall A-C #4210 on Thursday at 11 next week! Unfortunately, I cannot make it in person but Xinlei Chen will be there! Thanks for the constructive feedback from the reviewers. Check out the poster and come talk to us!

Alex Li (@alexlioralexli) 's Twitter Profile Photo

I'm presenting our #NeurIPS2024 work on Attention Transfer today! Key finding: Pretrained representations aren't essential - just using attention patterns from pretrained models to guide token interactions is enough for models to learn high-quality features from scratch and

I'm presenting our #NeurIPS2024 work on Attention Transfer today!

Key finding: Pretrained representations aren't essential - just using attention patterns from pretrained models to guide token interactions is enough for models to learn high-quality features from scratch and
Zhuang Liu (@liuzhuang1234) 's Twitter Profile Photo

How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit

How far is an LLM from not only understanding but also generating visually?

Not very far!

Introducing MetaMorph---a multimodal understanding and generation model. 

In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
Shiry Ginosar (@shiryginosar) 's Twitter Profile Photo

New paper! A SSL object-centric 2.1D image representation using 3D Gaussians, extending MAE with a Gaussian bottleneck. While Gaussian splatting has been used for single-scene reconstruction, we’re the first to apply it to image representation learning! brjathu.github.io/gmae/

New paper! A SSL object-centric 2.1D image representation using 3D Gaussians, extending MAE with a Gaussian bottleneck. While Gaussian splatting has been used for single-scene reconstruction, we’re the first to apply it to image representation learning! brjathu.github.io/gmae/
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*On the Surprising Effectiveness of Attention Transfer for Vision Transformers* by Yuandong Tian Beidi Chen Deepak Pathak Xinlei Chen Alex Li Shows that distilling attention patterns in ViTs is competitive with standard fine-tuning. arxiv.org/abs/2411.09702

*On the Surprising Effectiveness of Attention Transfer
for Vision Transformers*
by <a href="/tydsh/">Yuandong Tian</a> <a href="/BeidiChen/">Beidi Chen</a> <a href="/pathak2206/">Deepak Pathak</a> <a href="/endernewton/">Xinlei Chen</a> <a href="/alexlioralexli/">Alex Li</a> 

Shows that distilling attention patterns in ViTs is competitive with standard fine-tuning.

arxiv.org/abs/2411.09702
Rohit Girdhar (@_rohitgirdhar_) 's Twitter Profile Photo

Super excited to share some recent work that shows that pure, text-only LLMs, can see and hear without any training! Our approach, called "MILS", uses LLMs with off-the-shelf multimodal models, to caption images/videos/audio, improve image generation, style transfer, and more!

Super excited to share some recent work that shows that pure, text-only LLMs, can see and hear without any training! Our approach, called "MILS", uses LLMs with off-the-shelf multimodal models, to caption images/videos/audio, improve image generation, style transfer, and more!
David Fan (@davidjfan) 's Twitter Profile Photo

Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

Can visual SSL match CLIP on VQA?

Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.