Nikhil Barhate (@nikhilbarhate99) 's Twitter Profile
Nikhil Barhate

@nikhilbarhate99

ML @scale_AI | prev @AMD @mila_quebec

ID: 3245294515

linkhttps://nikhilbarhate99.github.io calendar_today14-06-2015 15:04:57

1,1K Tweet

204 Followers

810 Following

Amandeep Kumar (@amandee59573123) 's Twitter Profile Photo

๐Ÿš€ Unlocking Standard Diffusion Transformers on Representation Encoders Why do standard DiTs fail to converge on high-dimensional features like DINOv2? ๐Ÿ“‰ We found the answer isn't just "more parameters"โ€”it's Geometry. Introducing Riemannian Flow Matching with Jacobi

๐Ÿš€ Unlocking Standard Diffusion Transformers on Representation Encoders

Why do standard DiTs fail to converge on high-dimensional features like DINOv2? ๐Ÿ“‰ We found the answer isn't just "more parameters"โ€”it's Geometry.

Introducing Riemannian Flow Matching with Jacobi
Yibo Yang (@yiboyang) 's Twitter Profile Photo

We've known that diffusion models are theoretically very good lossy data compressors , but how can we actually implement this idea in practice? I discuss this and related topics in a new review article on diffusion-based generative compression arxiv.org/abs/2601.18932

We've known that diffusion models are theoretically very good lossy data compressors , but how can we actually implement this idea in practice? I discuss this and related topics in a new review article on diffusion-based generative compression arxiv.org/abs/2601.18932
Tyler Griggs (@tyler_griggs_) 's Twitter Profile Photo

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: novasky-ai.notion.site/skyrl-tinker ๐Ÿงต

SkyRL now implements the Tinker API.

Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends.

Blog: novasky-ai.notion.site/skyrl-tinker
๐Ÿงต
Oscar Davis (@osclsd) 's Twitter Profile Photo

You like discrete diffusion, but it's too slow? ๐Ÿฅ€ You like test-time inference, but it's for continuous methods? ๐Ÿ˜ฉ We fixed it. Introducing Categorical Flow Maps: continuously sample discrete data in a single step ๐Ÿš€๐Ÿ’ซ How? ๐Ÿงตโฌ‡๏ธ ๐Ÿ’ช Co-led with Floor Eijkelboom, Daan Roos

Alan Baade (@baadealan) 's Twitter Profile Photo

What's the right space to diffuse in: Raw Data or Latents? Why not both! In Latent Forcing, we order a joint diffusion trajectory to reveal Latents before Pixels, leading to improved convergence while being lossless at encoding and end-to-end at inference. w/ Fei-Fei Li+... 1/n

What's the right space to diffuse in: Raw Data or Latents?

Why not both!

In Latent Forcing, we order a joint diffusion trajectory to reveal Latents before Pixels, leading to improved convergence while being lossless at encoding and end-to-end at inference.

w/ <a href="/drfeifei/">Fei-Fei Li</a>+...
1/n
Charlie Ruan (@charlie_ruan) 's Twitter Profile Photo

Releasing the official SkyRL + Harbor integration: a standardized way to train terminal-use agents with RL. From the creators of Terminal-Bench, Harbor is a widely adopted framework for evaluating terminal-use agents on any task expressible as a Dockerfile + instruction + test

Releasing the official SkyRL + Harbor integration: a standardized way to train terminal-use agents with RL.

From the creators of Terminal-Bench, Harbor is a widely adopted framework for evaluating terminal-use agents on any task expressible as a Dockerfile + instruction + test
Jason Ramapuram (@jramapuram) 's Twitter Profile Photo

Autoregressive models dominate, but what if we treat multimodal generation as discrete order agnostic iterative refinement? Excited to share our systematic study on the design space of Tri-Modal Masked Diffusion Models (MDMs). We pre-trained the first Tri-Modal MDM from scratch

Autoregressive models dominate, but what if we treat multimodal generation as discrete order agnostic iterative refinement?  Excited to share our systematic study on the design space of Tri-Modal Masked Diffusion Models (MDMs). We pre-trained the first Tri-Modal MDM from scratch
Peter Tong (@tongpetersb) 's Twitter Profile Photo

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision.
We share our exploration: visual representations, data, world modeling, architecture, and
William Shen (@shenbokui) 's Twitter Profile Photo

Excited to introduce Uni-1, our new multimodal model that *unifies* understanding and generation. TLDR: a team of ~15 researchers is going pound-for-pound with nano banana and gpt image ๐Ÿงต

Excited to introduce Uni-1, our new multimodal model that *unifies* understanding and generation.

TLDR: a team of ~15 researchers is going pound-for-pound with nano banana and gpt image ๐Ÿงต
Ian Osband (@ianosband) 's Twitter Profile Photo

Something is rotten with policy gradient. PG has become *the* RL loss for LLMs. But itโ€™s not even good at basic RL. Even on MNIST with bandit feedback, vanilla PG performs far worse than cross-entropy because it wastes gradient budget. Delightful Policy Gradient:

Something is rotten with policy gradient.

PG has become *the* RL loss for LLMs. But itโ€™s not even good at basic RL.

Even on MNIST with bandit feedback, vanilla PG performs far worse than cross-entropy because it wastes gradient budget.

Delightful Policy Gradient:
Olga Zaghen @ ICLR ๐Ÿ‡ธ๐Ÿ‡ฌ (@olgazaghen) 's Twitter Profile Photo

๐Ÿ”ฎ Working on ML on curved manifolds? Don't miss out on Jacobi Fields! ๐Ÿ”ฎ I wrote a quick, highly visual and hopefully accessible introduction to the topic: "Jacobi Fields in Machine Learning" ๐Ÿค  Check it out here: olgatticus.github.io/blog/jacobi-fiโ€ฆ!

๐Ÿ”ฎ Working on ML on curved manifolds? Don't miss out on Jacobi Fields! ๐Ÿ”ฎ

I wrote a quick, highly visual and hopefully accessible introduction to the topic: "Jacobi Fields in Machine Learning" ๐Ÿค  Check it out here: olgatticus.github.io/blog/jacobi-fiโ€ฆ!
chuyi shang (@chuyishang) 's Twitter Profile Photo

Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If youโ€™re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post:

Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training!

If youโ€™re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training,  check out this blog post:
Max Fu (@letian_fu) 's Twitter Profile Photo

Robotics: coding agentsโ€™ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve

Hojoon Lee (@hojoon_ai) 's Twitter Profile Photo

We scaled off-policy RL to sim-to-real. To our knowledge, FlashSAC is the fastest and most performant RL algorithm across IsaacLab, MuJoCo Playground, and many more, all with a single set of hyperparameters. Project page: holiday-robot.github.io/FlashSAC Paper: arxiv.org/pdf/2604.04539

Anirudh Goyal (@anirudhg9119) 's Twitter Profile Photo

Reasoning doesnโ€™t have to mean longer chains of thought: PDR = draft in parallel โ†’ distill into a compact workspace โ†’ refine, and shift the Pareto frontier. arxiv.org/abs/2510.01123

Reasoning doesnโ€™t have to mean longer chains of thought: 

PDR = draft in parallel โ†’ distill into a compact workspace โ†’ refine, and shift the Pareto frontier.

arxiv.org/abs/2510.01123
Mingchen Zhuge (๐Ÿ‡ธ๐Ÿ‡ฌ ICLR) (@mingchenzhuge) 's Twitter Profile Photo

๐Ÿซฑ Introducing ๐๐ž๐ฎ๐ซ๐š๐ฅ ๐‚๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซs: ๐ฐ๐ก๐š๐ญ ๐ข๐Ÿ ๐€๐ˆ ๐๐จ๐ž๐ฌ ๐ง๐จ๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐ฎ๐ฌ๐ž ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซ๐ฌ ๐›๐ž๐ญ๐ญ๐ž๐ซ, ๐›๐ฎ๐ญ ๐›๐ž๐ ๐ข๐ง๐ฌ ๐ญ๐จ ๐›๐ž๐œ๐จ๐ฆ๐ž ๐ญ๐ก๐ž ๐ซ๐ฎ๐ง๐ง๐ข๐ง๐  ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซ ๐ข๐ญ๐ฌ๐ž๐ฅ๐Ÿ? Beyond today's conventional computers, agents, and

๐Ÿซฑ Introducing ๐๐ž๐ฎ๐ซ๐š๐ฅ ๐‚๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซs: 

๐ฐ๐ก๐š๐ญ ๐ข๐Ÿ ๐€๐ˆ ๐๐จ๐ž๐ฌ ๐ง๐จ๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐ฎ๐ฌ๐ž ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซ๐ฌ ๐›๐ž๐ญ๐ญ๐ž๐ซ, ๐›๐ฎ๐ญ ๐›๐ž๐ ๐ข๐ง๐ฌ ๐ญ๐จ ๐›๐ž๐œ๐จ๐ฆ๐ž ๐ญ๐ก๐ž ๐ซ๐ฎ๐ง๐ง๐ข๐ง๐  ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซ ๐ข๐ญ๐ฌ๐ž๐ฅ๐Ÿ?

Beyond today's conventional computers, agents, and
Yu Lei (@_outofmemory_) 's Twitter Profile Photo

๐Ÿค–Co-training is everywhere (simโ†”real[e.g. GR00T, LBM], humanโ†”robot[e.g. PI, EgoScale], even non-robot data[e.g. PI, LBM). But why does it work? How can we improve it further? Taking sim-and-real imitation learning in diffusion/ flow-based models as the test bed, we performed

๐Ÿค–Co-training is everywhere (simโ†”real[e.g. GR00T, LBM], humanโ†”robot[e.g. PI, EgoScale], even non-robot data[e.g. PI, LBM).
 But why does it work? How can we improve it further?

Taking sim-and-real imitation learning in diffusion/ flow-based models as the test bed, we performed
Chongyi Zheng (@chongyiz1) 's Twitter Profile Photo

1/ Reinforcement learning is usually framed as maximizing rewards. But can we cast it as reaching the right goals? New blog on bridging RL, goal-conditioned RL, and stochastic shortest path: iclr-blogposts.github.io/2026/blog/2026โ€ฆ Also #ICLR2026 Poster: Thu 10:30 AMโ€“1:00 PM, P4 #4611. ๐Ÿงตโฌ‡๏ธ

1/ Reinforcement learning is usually framed as maximizing rewards. But can we cast it as reaching the right goals?

New blog on bridging RL, goal-conditioned RL, and stochastic shortest path:

iclr-blogposts.github.io/2026/blog/2026โ€ฆ

Also #ICLR2026 Poster: Thu 10:30 AMโ€“1:00 PM, P4 #4611. 

๐Ÿงตโฌ‡๏ธ
Taco Cohen (@tacocohen) 's Twitter Profile Photo

Apparently it is not well known and not easy to see that this "simple masked loss" is EXACTLY gradient-equivalent to PPO-Clip (at least for one way of computing the mask). Here's how to see this: The standard token-level PPO-Clip objective is the rather unintuitive J_t =