Shivam Duggal (@shivamduggal4) 's Twitter Profile
Shivam Duggal

@shivamduggal4

PhD Student @MIT |
Prev: Carnegie Mellon University @SCSatCMU | Research Scientist @UberATG

ID: 880123482947723264

linkhttp://shivamduggal4.github.io calendar_today28-06-2017 17:59:25

116 Tweet

916 Followers

408 Following

Phillip Isola (@phillip_isola) 's Twitter Profile Photo

Our new work on adaptive image tokenization: Image —> T tokens * variable T, based on image complexity * single forward pass both infers T and tokenizes to T tokens * approximates minimum description length encoding of the image

Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

Indeed! I find H-Net to be closely related to KARL — and even our earlier work ALIT (the recurrent tokenizer in the figure below) shares strong connections. Loved reading H-Net, like all Albert Gu’s work. Congrats to Sukjun (June) Hwang and team!

Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

Great work from great people! Mihir Prabhudesai Deepak Pathak AR aligns w/ compression theory (KC, MDL, arithmetic coding), but diffusion is MLE too. Can we interpret diffusion similarly? Curious how compression explains AR vs. diffusion scaling laws. (Ilya’s talk touches on this too.)

Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

For NeurIPS Conference, we can't update the main PDF or upload a separate rebuttal PDF — so no way to include any new images or visual results? What if reviewers ask for more vision experiments? 🥲 Any suggestions or workarounds?

Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

One "Skild brain" powers all embodiments—amazing work! Huge congratulations to entire team. Excited to see what’s next. Miss you all <3 !

Mihir Prabhudesai (@mihirp98) 's Twitter Profile Photo

We ran more experiments to better understand “why” diffusion models do better in data-constrained settings than autoregressive. Our findings support the hypothesis that diffusion models benefit from learning over multiple token orderings, which contributes to their robustness and

We ran more experiments to better understand “why” diffusion models do better in data-constrained settings than autoregressive. Our findings support the hypothesis that diffusion models benefit from learning over multiple token orderings, which contributes to their robustness and
Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

Strongest compressors might not be the best decoders for your task. RL can adapt pre-trained models into more "sophisticated" decoders, tuned to the task’s specific demands. Exciting thread & research! Question: is next-token prediction really the final chapter in pretraining?

Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

Talking about KARL today — our recent work on a Kolmogorov Complexity–inspired adaptive tokenizer. Details about the paper here: x.com/ShivamDuggal4/… More broadly, quite excited about representation learning — and understanding large models — through the lens of compression.

Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

Enjoying GPT-5 a lot! Research Q–Maybe intelligence is discovering the simplest algorithm that generalizes (5→N digit +). GPT-5 may be close for +/*, but what enables RL on top of (constrained) next-token pretraining to discover the least-KC algorithm for all tasks? Thoughts?

Ken Liu (@kenziyuliu) 's Twitter Profile Photo

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions.

Instead of contrived exams where progress ≠ value, we eval LLMs on organic, unsolved problems via reference-free LLM validation &amp; community verification. LLMs solved ~10/500 so far:
Shivam Duggal (@shivamduggal4) 's Twitter Profile Photo

Amazing work idan shenfeld Jyo Pari SFT w/ per-token supervision is probably too constrained to map new/old data into a shared weight space. Wondering adding continuous thinking tokens (so still no RL) before supervised prediction could relax this, while staying off-policy?

jack morris (@jxmnop) 's Twitter Profile Photo

nearly everything in AI can be understood through the lens of compression - the architecture is just schema for when & how to compress - optimization is a compression *process*, with its own compression level and duration - (architecture + data + optimization) = model - in other

Jeremy Bernstein (@jxbz) 's Twitter Profile Photo

I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number x.com/thinkymachines…

Yulu Gan (@yule_gan) 's Twitter Profile Photo

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for

Sophie Wang (@sophielwang) 's Twitter Profile Photo

LLMs, trained only on text, might already know more about other modalities than we realized; we just need to find ways elicit it. project page: sophielwang.com/sensory w/ Phillip Isola and Brian Cheung

Sharut Gupta (@sharut_gupta) 's Twitter Profile Photo

[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same? Our new work shows that the answer is yes! w/ Shobhita Sundaram Chenyu (Monica) Wang, Stefanie Jegelka and Phillip Isola

[1/7] Paired multimodal learning shows that training with text can help vision models learn better image representations. But can unpaired data do the same?
Our new work shows that the answer is yes!

w/ <a href="/shobsund/">Shobhita Sundaram</a> <a href="/ChenyuW64562111/">Chenyu (Monica) Wang</a>, Stefanie Jegelka and <a href="/phillip_isola/">Phillip Isola</a>
Phillip Isola (@phillip_isola) 's Twitter Profile Photo

Over the past year, my lab has been working on fleshing out theory/applications of the Platonic Representation Hypothesis. Today I want to share two new works on this topic: Eliciting higher alignment: arxiv.org/abs/2510.02425 Unpaired rep learning: arxiv.org/abs/2510.08492 1/9

Saining Xie (@sainingxie) 's Twitter Profile Photo

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right.

today, we introduce Representation Autoencoders (RAE).

&gt;&gt; Retire VAEs. Use RAEs. 👇(1/n)
Nupur Kumari (@nupurkmr9) 's Twitter Profile Photo

🚀 New preprint! We present NP-Edit, a framework for training an image editing diffusion model without paired supervision. We use differentiable feedback from Vision-Language Models (VLMs) combined with distribution-matching loss (DMD) to learn editing directly. webpage: