Ellis Brown (@_ellisbrown) Twitter Tweets • TwiCopy

Willis (Nanye) Ma

10 months ago

Inference-time scaling for LLMs drastically improves the model's ability in many perspectives, but what about diffusion models? In our latest study—Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps—we reframe inference-time scaling as a search problem

thumb_up_off_alt531

chat_bubble_outline10

repeat91

shareShare

Saining Xie

@sainingxie

10 months ago

When I first saw diffusion models, I was blown away by how naturally they scale during inference: you train them with fixed flops, but during test time, you can ramp it up by like 1,000x. This was way before it became a big deal with o1. But honestly, the scaling isn’t that

thumb_up_off_alt474

chat_bubble_outline9

repeat70

shareShare

Zaid Khan

@codezakh

9 months ago

✨ Introducing MutaGReP (Mutation-guided Grounded Repository Plan Search) - an approach that uses LLM-guided tree search to find realizable plans that are grounded in a target codebase without executing any code! Ever wanted to provide an entire repo containing 100s of 1000s of

thumb_up_off_alt88

chat_bubble_outline1

repeat36

shareShare

Baifeng

@baifeng_shi

8 months ago

Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we

thumb_up_off_alt971

chat_bubble_outline27

repeat151

shareShare

David Fan

@davidjfan

8 months ago

Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

thumb_up_off_alt452

chat_bubble_outline12

repeat93

shareShare

Peter Tong

@tongpetersb

8 months ago

Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.

thumb_up_off_alt484

chat_bubble_outline8

repeat84

shareShare

Mihir Prabhudesai

@mihirp98

8 months ago

1/ Happy to share UniDisc - Unified Multimodal Discrete Diffusion – We train a 1.5 billion parameter transformer model from scratch on 250 million image/caption pairs using a **discrete diffusion objective**. Our model has all the benefits of diffusion models but now in

thumb_up_off_alt787

chat_bubble_outline5

repeat112

shareShare

Xichen Pan

@xichen_pan

7 months ago

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

thumb_up_off_alt402

chat_bubble_outline9

repeat65

shareShare

Alex Li

@alexlioralexli

7 months ago

Excited to be presenting at #ICLR2025 at 10am today on how generative classifiers are much more robust to distribution shift. Come by to chat and say hello!

thumb_up_off_alt91

chat_bubble_outline2

repeat6

shareShare

Matt Deitke

@mattdeitke

7 months ago

I’m very excited to introduce Vy, the AI that sees and acts on your computer! It’s a first glimpse of what we’ve been working on at Vercept! Early computers trapped the world's best experts in low-level tasks–loading code, managing memory, fighting errors. Progress

thumb_up_off_alt76

chat_bubble_outline12

repeat18

shareShare

Rob Fergus

@rob_fergus

6 months ago

1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.

thumb_up_off_alt342

chat_bubble_outline15

repeat24

shareShare

Ellis Brown

@_ellisbrown

6 months ago

Honored to be recognized as a #CVPR2025 Outstanding Reviewer!

thumb_up_off_alt34

chat_bubble_outline0

repeat0

shareShare

Matt Deitke

@mattdeitke

5 months ago

Molmo won the Best Paper Honorable Mention award #CVPR2025! This work was a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data! Proud to see what it became. #CVPR2025

Molmo won the Best Paper Honorable Mention award <a href="/CVPR/">#CVPR2025</a>!

This work was a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data! Proud to see what it became.

#CVPR2025

thumb_up_off_alt226

chat_bubble_outline18

repeat18

shareShare

Mihir Prabhudesai

@mihirp98

5 months ago

1/ Maximizing confidence indeed improves reasoning. We worked with Shashwat Goel, Nikhil Chandak Ameya P. for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

1/ Maximizing confidence indeed improves reasoning. We worked with <a href="/ShashwatGoel7/">Shashwat Goel</a>, <a href="/nikhilchandak29/">Nikhil Chandak</a> <a href="/AmyPrb/">Ameya P.</a> for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

thumb_up_off_alt49

chat_bubble_outline1

repeat13

shareShare

Alexi Gladstone

@alexiglad

4 months ago

How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the

thumb_up_off_alt1,1K

chat_bubble_outline32

repeat208

shareShare

Shivam Duggal

@shivamduggal4

4 months ago

Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

thumb_up_off_alt329

chat_bubble_outline10

repeat62

shareShare

Lili

@lchen915

3 months ago

Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.

thumb_up_off_alt769

chat_bubble_outline16

repeat130

shareShare

Peter Tong

@tongpetersb

3 months ago

Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench (arxiv.org/abs/2406.16860) and Blink (arxiv.org/abs/2404.12390), which repurpose core vision tasks into VQA format. These benchmarks do help

thumb_up_off_alt61

chat_bubble_outline1

repeat14

shareShare