Hemil Desai (@hemildesai10) 's Twitter Profile
Hemil Desai

@hemildesai10

Senior Software Engineer @NVIDIA, opinions are my own

ID: 601679997

linkhttps://hd10.dev calendar_today07-06-2012 05:54:07

376 Tweet

299 Followers

5,5K Following

Agentica Project (@agentica_) 's Twitter Profile Photo

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models.

💪DeepSWE
Hemil Desai (@hemildesai10) 's Twitter Profile Photo

NeMo-Skills is a powerhouse for just about any workflow related to LLMs. Glad to have indirectly contributed to it via github.com/NVIDIA-NeMo/Run

Ernest Ryu (@ernestryu) 's Twitter Profile Photo

New lecture recordings on RL+LLM! 📺 This spring, I gave a lecture series titled **Reinforcement Learning of Large Language Models**. I have decided to re-record these lectures and share them on YouTube. (1/7)

Jacob Austin (@jacobaustin132) 's Twitter Profile Photo

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n
Bryan Catanzaro (@ctnzr) 's Twitter Profile Photo

Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate. Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus. Links to the

Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate.

Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus.

Links to the
Alex Zhang (@a1zhang) 's Twitter Profile Photo

announcing the GPU MODE x Scale ML summer speaker series happening next week, a 5⃣-day series where top researchers will teach about the algorithmic and systems-level advances that underpin `gpt-oss`! all content will be live-streamed & recorded for FREE on GPU MODE's YouTube!

announcing the <a href="/GPU_MODE/">GPU MODE</a> x <a href="/scaleml/">Scale ML</a> summer speaker series happening next week, a 5⃣-day series where top researchers will teach about the algorithmic and systems-level advances that underpin `gpt-oss`!

all content will be live-streamed &amp; recorded for FREE on GPU MODE's YouTube!
Thinking Machines (@thinkymachines) 's Twitter Profile Photo

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
Horace He (@chhillee) 's Twitter Profile Photo

Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!

Hemil Desai (@hemildesai10) 's Twitter Profile Photo

A while back, we set out to scale Automodel to the trillion-parameter range using native PyTorch parallelism and DTensor APIs. Enabling Pipeline Parallelism (PP) was the first major milestone toward that goal. This post outlines the key challenges we had to address to make PP

Hemil Desai (@hemildesai10) 's Twitter Profile Photo

We implemented optimized MoEs in NeMo Automodel with >200 TFLOPs per GPU per second on H100s in BF16 for Deepseek v3, both GPTOSS variants, and Qwen 3 MoE 30b. Perf achieved by combining PyTorch native Parallelisms (FSDP, EP and PP) with NVIDIA ’s Transformer Engine and

Hemil Desai (@hemildesai10) 's Twitter Profile Photo

Happy to have contributed to this effort 🚀 We also published an in depth technical deep dive on the implementation. Check it out here - github.com/NVIDIA-NeMo/Au…

NVIDIA AI Developer (@nvidiaaidev) 's Twitter Profile Photo

Nemotron 3 Nano is now leading its size class on the latest Artificial Analysis leaderboards, combining strong intelligence, high openness, and blazing output speed in a compact package. Part of new NVIDIA Nemotron 3 family, Nemotron 3 Nano is powered by various advanced

Hemil Desai (@hemildesai10) 's Twitter Profile Photo

We just added MoE expert load balance metric visualization to NVIDIA AI NeMo Automodel. Check it out here github.com/NVIDIA-NeMo/Au… Monitor expert utilization and load imbalance in your PyTorch based MoE training run with ease.

Bryan Catanzaro (@ctnzr) 's Twitter Profile Photo

Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!

Announcing NVIDIA Nemotron 3 Super!

💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell
💚36 on AAIndex v4
💚up to 2.2X faster than GPT-OSS-120B in FP4
💚Open data, open recipe, open weights

Models, Tech report, etc. here:
research.nvidia.com/labs/nemotron/…

And yes, Ultra is coming!
Jiantao Jiao (@jiantaoj) 's Twitter Profile Photo

Nemotron 3 Super arrived! With efficiency in mind (Hybrid SSM Latent MoE, designed for Blackwell), the accuracy is also incredible. The most important aspect is scaling RL, utilizing the highly efficient and scalable Nemo Gym backend for RL environments and Nemo RL for model