Walter Hugo Lopez Pinaya 🍍 (@warvito) Twitter Tweets • TwiCopy

Saining Xie

7 months ago

I used to think diffusion models struggled to denoise efficiently in high-dimensional spaces -- but I was wrong again. since RAE latent spaces are inherently high-dimensional, diffusion transformers require adaptation, but with just three simple tweaks, they perform *remarkably*

thumb_up_off_alt109

chat_bubble_outline2

repeat8

shareShare

Abdullah Hamdi

@eng_hemdi

7 months ago

If you are attending #ICCV2025 this week please check our 3 main conference papers and 1 oral paper at the workshops covering topics on spatial intelligence and medical imaging 1- UKBOB : the biggest 3D MRI segmentation dataset of over 1 billion labeled masks + SOTa foundation

thumb_up_off_alt64

chat_bubble_outline2

repeat11

shareShare

Cai Zhou

@zhuci19

7 months ago

(1/6) Check out our new paper: Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model A Latent Reasoner! arxiv: arxiv.org/abs/2510.03206 Do diffusion language models (DLMs) need to be discrete? No! We show that continuous diffusion models are more

thumb_up_off_alt86

chat_bubble_outline2

repeat24

shareShare

Andrej Karpathy

@karpathy

7 months ago

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've

thumb_up_off_alt5,5K

chat_bubble_outline257

repeat566

shareShare

Xintao Wang

@xinntao

7 months ago

🥳🥳DiT w/o VAE, but with Semantic Encoder, such as DINO! We introduce SVG (Self-supervised representation for Visual Generation) . Paper: huggingface.co/papers/2510.15… Code: github.com/shiml20/SVG

thumb_up_off_alt366

chat_bubble_outline8

repeat55

shareShare

Kwang Moo Yi

@kwangmoo_yi

7 months ago

Choudhury and Kim et al., "Accelerating Vision Transformers With Adaptive Patch Sizes" Transformer patches don't need to be of uniform size -- choose sizes based on entropy --> faster training/inference. Are scale-spaces gonna make a comeback?

thumb_up_off_alt387

chat_bubble_outline8

repeat36

shareShare

Vaibhav (VB) Srivastav

@reach_vb

7 months ago

Chinese doordash dropping MIT license foundation video models??? “We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across Text-to-Video, Image-to-Video, and Video-Continuation generation tasks.”

thumb_up_off_alt760

chat_bubble_outline22

repeat96

shareShare

Meituan LongCat

@meituan_longcat

6 months ago

🚀 LongCat-Video Now Open-Source: Text/Image-to-Video + Video Continuation in One Model 🏆 Text/Image-to-Video Performance Hits Open-Source SOTA 🎬 Minutes-Long High-Quality Videos: No Color Drift/Quality Loss (Industry-Standout) ⚙ 13.6B Params | Strong Open-Source DiT-Based

thumb_up_off_alt394

chat_bubble_outline9

repeat67

shareShare

fly51fly

@fly51fly

6 months ago

[CV] Accelerating Vision Transformers with Adaptive Patch Sizes R Choudhury, J Kim, J Park, E Yang... [CMU & KAIST] (2025) arxiv.org/abs/2510.18091

thumb_up_off_alt252

chat_bubble_outline0

repeat36

shareShare

DailyPapers

@huggingpapers

6 months ago

A new Latent Diffusion Model without VAE from Kuaishou Technology is here! Introducing SVG: it ditches the VAE for self-supervised representations, enabling 62x faster training & 35x faster inference, all while boosting generative quality.

thumb_up_off_alt317

chat_bubble_outline8

repeat55

shareShare

Chieh-Hsin (Jesse) Lai

@jcjesselai

6 months ago

Tired to go back to the original papers again and again? Our monograph: a systematic and fundamental recipe you can rely on! 📘 We’re excited to release 《The Principles of Diffusion Models》— with Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. It traces the core

thumb_up_off_alt1,1K

chat_bubble_outline41

repeat418

shareShare

Pramod Goyal

@goyal__pramod

6 months ago

If you are interested in Diffusion models. And wish to understand them in depth. This might be the best resource out there!

thumb_up_off_alt681

chat_bubble_outline7

repeat52

shareShare

ℏεsam

@hesamation

6 months ago

holy shit... Hugging Face cooked again! 🔥 they just dropped a free blog (BOOK) that covers the no-bs reality of building SOTA models. i haven't seen any lab/researcher go into the real decisions behind the LLM research and its nuances. this is literally a gem. Syllabus: →

thumb_up_off_alt1,1K

chat_bubble_outline25

repeat216

shareShare

ModelScope

@maasai42

6 months ago

🚀 Training 64K+ context LLMs on consumer GPUs? Now possible with Ulysses + Ring Attention! We’ve fused two sequence parallelism techniques in ModelScope SWIFT: ✅ Ulysses: Low-comm, head-split (but limited by # of attention heads) ✅ Ring Attention: Scales beyond head count

thumb_up_off_alt136

chat_bubble_outline4

repeat28

shareShare

Shumpei Takezaki

@shumpeimaxwell

6 months ago

Diffusion Transformers with Representation Autoencoders speakerdeck.com/shumpei777/pre…

thumb_up_off_alt247

chat_bubble_outline0

repeat37

shareShare

Rohan Paul

@rohanpaul_ai

6 months ago

TabTune makes tabular AI models easy to try, compare, and trust. It hides messy prep and gives 1 simple fit, predict, evaluate flow. Work on tables is messy because every model wants different preprocessing, training modes, and metrics. This paper's technique supports 7

thumb_up_off_alt14

chat_bubble_outline5

repeat4

shareShare

Leon Klein

@leonklein26

6 months ago

(1/n) Can diffusion models simulate molecular dynamics instead of generating independent samples? In our NeurIPS2025 paper, we train energy-based diffusion models that can do both: - Generate independent samples - Learn the underlying potential 𝑼 🧵👇 arxiv.org/abs/2506.17139

thumb_up_off_alt702

chat_bubble_outline11

repeat119

shareShare

Sean McLeish

@seanmcleish

6 months ago

Looped latent reasoning models like TRM, HRM, Ouro and Huginn are great for reasoning, but they’re inefficient to train at larger scales. We fix this by post training regular language models into looped models, achieving higher accuracy on a per training FLOP basis. 📜1/7

thumb_up_off_alt381

chat_bubble_outline10

repeat64

shareShare

Jacob Bamberger

@jacobbamberger

6 months ago

Flow Matching models often struggle to balance memorization and generalization. 😱 We set out to fix this — by using the geometry of the data manifold. Introducing Carré du Champ Flow Matching (CDCFM)🧑‍🎨🥖 — improving generalization without sacrificing sample quality.

thumb_up_off_alt419

chat_bubble_outline11

repeat59

shareShare

Niels Rogge

@nielsrogge

6 months ago

This is a phenomenal video by Jia-Bin Huang explaining seminal papers in computer vision, including CLIP, SimCLR, DINO v1/v2/v3 in 15 minutes DINO is actually a brilliant idea, I found the decision of 65k neurons in the output head pretty interesting

This is a phenomenal video by <a href="/jbhuang0604/">Jia-Bin Huang</a> explaining seminal papers in computer vision, including CLIP, SimCLR, DINO v1/v2/v3 in 15 minutes

DINO is actually a brilliant idea, I found the decision of 65k neurons in the output head pretty interesting

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat124

shareShare