Ellis Brown (@_ellisbrown) 's Twitter Profile
Ellis Brown

@_ellisbrown

CS PhD Student @NYU_Courant w/ Profs @sainingxie @rob_fergus |
Intern @ Ai2 | Prev: @CarnegieMellon, blackrock.com/corporate/ai, @VanderbiltU

ID: 4703177401

linkhttps://ellisbrown.github.io/ calendar_today03-01-2016 14:27:39

346 Tweet

599 Takipçi

630 Takip Edilen

Willis (Nanye) Ma (@ma_nanye) 's Twitter Profile Photo

Inference-time scaling for LLMs drastically improves the model's ability in many perspectives, but what about diffusion models? In our latest study—Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps—we reframe inference-time scaling as a search problem

Inference-time scaling for LLMs drastically improves the model's ability in many perspectives, but what about diffusion models?
In our latest study—Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps—we reframe inference-time scaling as a search problem
Saining Xie (@sainingxie) 's Twitter Profile Photo

When I first saw diffusion models, I was blown away by how naturally they scale during inference: you train them with fixed flops, but during test time, you can ramp it up by like 1,000x. This was way before it became a big deal with o1. But honestly, the scaling isn’t that

Zaid Khan (@codezakh) 's Twitter Profile Photo

✨ Introducing MutaGReP (Mutation-guided Grounded Repository Plan Search) - an approach that uses LLM-guided tree search to find realizable plans that are grounded in a target codebase without executing any code! Ever wanted to provide an entire repo containing 100s of 1000s of

Baifeng (@baifeng_shi) 's Twitter Profile Photo

Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we

Next-gen vision pre-trained models shouldn’t be short-sighted.

Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage.

Today, we
David Fan (@davidjfan) 's Twitter Profile Photo

Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

Can visual SSL match CLIP on VQA?

Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
Peter Tong (@tongpetersb) 's Twitter Profile Photo

Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.

Vision models have been smaller than language models; what if we scale them up?

Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.
Mihir Prabhudesai (@mihirp98) 's Twitter Profile Photo

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a **discrete diffusion objective**. Our model has all the benefits of diffusion models but now in

Xichen Pan (@xichen_pan) 's Twitter Profile Photo

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all.
MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
Alex Li (@alexlioralexli) 's Twitter Profile Photo

Excited to be presenting at #ICLR2025 at 10am today on how generative classifiers are much more robust to distribution shift. Come by to chat and say hello!

Excited to be presenting at #ICLR2025 at 10am today on how generative classifiers are much more robust to distribution shift. Come by to chat and say hello!
Matt Deitke (@mattdeitke) 's Twitter Profile Photo

I’m very excited to introduce Vy, the AI that sees and acts on your computer! It’s a first glimpse of what we’ve been working on at Vercept! Early computers trapped the world's best experts in low-level tasks–loading code, managing memory, fighting errors. Progress

Rob Fergus (@rob_fergus) 's Twitter Profile Photo

1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.

Matt Deitke (@mattdeitke) 's Twitter Profile Photo

Molmo won the Best Paper Honorable Mention award #CVPR2025! This work was a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data! Proud to see what it became. #CVPR2025

Molmo won the Best Paper Honorable Mention award <a href="/CVPR/">#CVPR2025</a>!

This work was a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data! Proud to see what it became.

#CVPR2025
Mihir Prabhudesai (@mihirp98) 's Twitter Profile Photo

1/ Maximizing confidence indeed improves reasoning. We worked with Shashwat Goel, Nikhil Chandak Ameya P. for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

1/ Maximizing confidence indeed improves reasoning. We worked with <a href="/ShashwatGoel7/">Shashwat Goel</a>, <a href="/nikhilchandak29/">Nikhil Chandak</a> <a href="/AmyPrb/">Ameya P.</a> for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
Alexi Gladstone (@alexiglad) 's Twitter Profile Photo

How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the

How can we unlock generalized reasoning?

⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards.
TLDR:
- EBTs are the first model to outscale the
Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

Compression is the heart of intelligence
From Occam to Kolmogorov—shorter programs=smarter representations

Meet KARL: Kolmogorov-Approximating Representation Learning.

Given an image, token budget T &amp; target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
Lili (@lchen915) 's Twitter Profile Photo

Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.

Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL.

There is no external training data – the only input is a single prompt specifying the topic.
Peter Tong (@tongpetersb) 's Twitter Profile Photo

Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench (arxiv.org/abs/2406.16860) and Blink (arxiv.org/abs/2404.12390), which repurpose core vision tasks into VQA format. These benchmarks do help