Shivam Duggal (@shivamduggal4) Twitter Tweets • TwiCopy

Phillip Isola

5 months ago

Our new work on adaptive image tokenization: Image —> T tokens * variable T, based on image complexity * single forward pass both infers T and tokenizes to T tokens * approximates minimum description length encoding of the image

thumb_up_off_alt204

chat_bubble_outline0

repeat30

shareShare

Shivam Duggal

@shivamduggal4

5 months ago

Indeed! I find H-Net to be closely related to KARL — and even our earlier work ALIT (the recurrent tokenizer in the figure below) shares strong connections. Loved reading H-Net, like all Albert Gu’s work. Congrats to Sukjun (June) Hwang and team!

thumb_up_off_alt31

chat_bubble_outline1

repeat3

shareShare

Shivam Duggal

@shivamduggal4

4 months ago

Great work from great people! Mihir Prabhudesai Deepak Pathak AR aligns w/ compression theory (KC, MDL, arithmetic coding), but diffusion is MLE too. Can we interpret diffusion similarly? Curious how compression explains AR vs. diffusion scaling laws. (Ilya’s talk touches on this too.)

thumb_up_off_alt12

chat_bubble_outline1

repeat2

shareShare

Shivam Duggal

@shivamduggal4

4 months ago

For NeurIPS Conference, we can't update the main PDF or upload a separate rebuttal PDF — so no way to include any new images or visual results? What if reviewers ask for more vision experiments? 🥲 Any suggestions or workarounds?

thumb_up_off_alt11

chat_bubble_outline5

repeat0

shareShare

Shivam Duggal

@shivamduggal4

4 months ago

One "Skild brain" powers all embodiments—amazing work! Huge congratulations to entire team. Excited to see what’s next. Miss you all <3 !

thumb_up_off_alt18

chat_bubble_outline0

repeat2

shareShare

Mihir Prabhudesai

@mihirp98

4 months ago

We ran more experiments to better understand “why” diffusion models do better in data-constrained settings than autoregressive. Our findings support the hypothesis that diffusion models benefit from learning over multiple token orderings, which contributes to their robustness and

thumb_up_off_alt546

chat_bubble_outline8

repeat61

shareShare

Shivam Duggal

@shivamduggal4

4 months ago

Strongest compressors might not be the best decoders for your task. RL can adapt pre-trained models into more "sophisticated" decoders, tuned to the task’s specific demands. Exciting thread & research! Question: is next-token prediction really the final chapter in pretraining?

thumb_up_off_alt10

chat_bubble_outline0

repeat2

shareShare

Shivam Duggal

@shivamduggal4

4 months ago

Talking about KARL today — our recent work on a Kolmogorov Complexity–inspired adaptive tokenizer. Details about the paper here: x.com/ShivamDuggal4/… More broadly, quite excited about representation learning — and understanding large models — through the lens of compression.

thumb_up_off_alt21

chat_bubble_outline0

repeat2

shareShare

Shivam Duggal

@shivamduggal4

3 months ago

Enjoying GPT-5 a lot! Research Q–Maybe intelligence is discovering the simplest algorithm that generalizes (5→N digit +). GPT-5 may be close for +/*, but what enables RL on top of (constrained) next-token pretraining to discover the least-KC algorithm for all tasks? Thoughts?

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Ken Liu

@kenziyuliu

3 months ago

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

thumb_up_off_alt362

chat_bubble_outline12

repeat72

shareShare

Shivam Duggal

@shivamduggal4

3 months ago

Amazing work idan shenfeld Jyo Pari SFT w/ per-token supervision is probably too constrained to map new/old data into a shared weight space. Wondering adding continuous thinking tokens (so still no RL) before supervised prediction could relax this, while staying off-policy?

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

jack morris

@jxmnop

3 months ago

nearly everything in AI can be understood through the lens of compression - the architecture is just schema for when & how to compress - optimization is a compression *process*, with its own compression level and duration - (architecture + data + optimization) = model - in other

thumb_up_off_alt1,1K

chat_bubble_outline65

repeat67

shareShare

Jeremy Bernstein

@jxbz

2 months ago

I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number x.com/thinkymachines…

thumb_up_off_alt355

chat_bubble_outline13

repeat36

shareShare

Yulu Gan

@yule_gan

2 months ago

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for

thumb_up_off_alt2,2K

chat_bubble_outline85

repeat352

shareShare

Sophie Wang

@sophielwang

2 months ago

LLMs, trained only on text, might already know more about other modalities than we realized; we just need to find ways elicit it. project page: sophielwang.com/sensory w/ Phillip Isola and Brian Cheung

thumb_up_off_alt574

chat_bubble_outline16

repeat63

shareShare

Sharut Gupta

@sharut_gupta

2 months ago

[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same? Our new work shows that the answer is yes! w/ Shobhita Sundaram Chenyu (Monica) Wang, Stefanie Jegelka and Phillip Isola

thumb_up_off_alt396

chat_bubble_outline9

repeat45

shareShare

Phillip Isola

@phillip_isola

2 months ago

Over the past year, my lab has been working on fleshing out theory/applications of the Platonic Representation Hypothesis. Today I want to share two new works on this topic: Eliciting higher alignment: arxiv.org/abs/2510.02425 Unpaired rep learning: arxiv.org/abs/2510.08492 1/9

thumb_up_off_alt629

chat_bubble_outline8

repeat108

shareShare

Saining Xie

@sainingxie

2 months ago

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

thumb_up_off_alt1,1K

chat_bubble_outline55

repeat321

shareShare

Shivam Duggal

@shivamduggal4

2 months ago

Check this thorough study on multimodal action-conditioned video generation from Anthea Li @ ICCV2025 and drop by her ICCV poster. Big congratulations 🙌

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Nupur Kumari

@nupurkmr9

2 months ago

🚀 New preprint! We present NP-Edit, a framework for training an image editing diffusion model without paired supervision. We use differentiable feedback from Vision-Language Models (VLMs) combined with distribution-matching loss (DMD) to learn editing directly. webpage:

thumb_up_off_alt171

chat_bubble_outline2

repeat29

shareShare