Siddharth Singh (@siddharth_3773) 's Twitter Profile
Siddharth Singh

@siddharth_3773

CS Ph.D. Candidate, University of Maryland
I specialize in parallelizing LLM training on 1000s of GPUs
Graduating in Spring 2025

ID: 1803570289583828992

linkhttp://siddharth9820.github.io calendar_today19-06-2024 23:27:30

40 Tweet

860 Followers

208 Following

VantAI (@vant_ai) 's Twitter Profile Photo

Announcing Neo-1: the world’s most advanced atomistic foundation model, unifying structure prediction and all-atom de novo generation for the first time - to decode and design the structure of life 🧵(1/10)

Brian Bartoldson (@bartoldson) 's Twitter Profile Photo

🚀 We fixed a major LLM post-training bottleneck! Our new method (TBA) combines trajectory balance with asynchronous training to speed up LLM RL 5-50x while improving results+scalability. For example, using VinePPO's GSM8K setup, we obtain +1.2% accuracy and 50x faster RL.

🚀 We fixed a major LLM post-training bottleneck! 

Our new method (TBA) combines trajectory balance with asynchronous training to speed up LLM RL 5-50x while improving results+scalability. 

For example, using VinePPO's GSM8K setup, we obtain +1.2% accuracy and 50x faster RL.
Parallel Software and Systems Group (@hpc_group) 's Twitter Profile Photo

We are on a roll, second successful dissertation defense in a week (March 28)! Congratulations to Siddharth Singh on becoming the second PhD graduate from PSSG!! Dissertation title: "Optimizing Communication in Parallel Deep Learning on Exascale-class Machines" #HPC #AI #HPC4AI

We are on a roll, second successful dissertation defense in a week (March 28)! Congratulations to <a href="/siddharth_3773/">Siddharth Singh</a> on becoming the second PhD graduate from PSSG!!

Dissertation title: "Optimizing Communication in Parallel Deep Learning on Exascale-class Machines"

#HPC #AI #HPC4AI
Gautam Kamath (@thegautamkamath) 's Twitter Profile Photo

Passing by luxury clothing retailers (Gucci, Prada, etc), I feel lucky that CS researchers don't waste money to play these types of vapid status games. Anyway, this is Dr. Prof. Kamath (rank-n university) thrilled to announce my group's 23 accepted NeurIPS papers!!!!

Siddharth Singh (@siddharth_3773) 's Twitter Profile Photo

There are more ways to improve model quality apart from chucking in more compute (although the latter is what keeps me employed). Great work!

François Fleuret (@francoisfleuret) 's Twitter Profile Photo

As expected, that was popular. Here is my attempt at consolidating all the answers into a list. - Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively - GQA (Group Query Attention): more Q than (K, V)

Stella Li (@stellalisy) 's Twitter Profile Photo

We empirically prove this with surgical experiments: 🐍 Directly rewarding string “python” → +11.8% performance 🚫 Random rewards BUT blocking code → gains disappear The "magic" is just surfacing useful patterns already learned in pre-training.

We empirically prove this with surgical experiments:
🐍 Directly rewarding string “python” → +11.8% performance
🚫 Random rewards BUT blocking code → gains disappear
The "magic" is just surfacing useful patterns already learned in pre-training.
Ahmad Beirami @ ICLR 2025 (@abeirami) 's Twitter Profile Photo

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.
Mihir Prabhudesai (@mihirp98) 's Twitter Profile Photo

1/ Maximizing confidence indeed improves reasoning. We worked with Shashwat Goel, Nikhil Chandak Ameya P. for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

1/ Maximizing confidence indeed improves reasoning. We worked with <a href="/ShashwatGoel7/">Shashwat Goel</a>, <a href="/nikhilchandak29/">Nikhil Chandak</a> <a href="/AmyPrb/">Ameya P.</a> for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

Glad we could together improve the scientific discourse around reasoning. Was great to see the authors reach out and incorporate all our feedback!

Aditya Tomar (@adityastomar_) 's Twitter Profile Photo

Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats

Can we break the memory wall for LLM inference via KV cache rematerialization?

🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference!

• 10–12.5x memory savings vs. FP16
• Near-zero accuracy loss
• Beats
Jonas Geiping (@jonasgeiping) 's Twitter Profile Photo

There's been a lot of discussion recently about parallel vs sequential reasoning. The recurrent models we trained this year are sequential, which makes them good at math, but slow (see pic) However, if you squint, models with recurrent-depth/loops are like diffusion models ...

There's been a lot of discussion recently about parallel vs sequential reasoning. 

The recurrent models we trained this year are sequential, which makes them good at math, but slow (see pic)

However, if you squint, models with recurrent-depth/loops are like diffusion models ...
Abhinav Bhatele (@bhatele) 's Twitter Profile Photo

A large number of PhD students in my group have graduated or will be graduating by Spring, so I am recruiting several PhD students for the next admission cycle (Fall 2026). If you want to work with us, apply by Dec 5 and drop me a short email. Please repost/share widely. #HPC #AI

A large number of PhD students in my group have graduated or will be graduating by Spring, so I am recruiting several PhD students for the next admission cycle (Fall 2026). If you want to work with us, apply by Dec 5 and drop me a short email. Please repost/share widely. #HPC #AI
Ben Pouladian (@benitoz) 's Twitter Profile Photo

Nemotron-Nano-V3 is NVIDIA’s next move: hybrid Mamba-Transformer-MoE, 30B params, beats China’s Qwen3 in quality and runs 6x faster on H200. This is the blueprint for physical AI Efficient long-context, sparse compute and models that scale across everything TPUs can’t touch!

Nemotron-Nano-V3 is NVIDIA’s next move: 

hybrid Mamba-Transformer-MoE, 30B params, beats China’s Qwen3 in quality and runs 6x faster on H200.

This is the blueprint for physical AI

Efficient long-context, sparse compute and models that scale across everything

TPUs can’t touch!
Bryan Catanzaro (@ctnzr) 's Twitter Profile Photo

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.
Zhaocheng Zhu (@zhu_zhaocheng) 's Twitter Profile Photo

📢 Hey open-source folks — you might not want to miss this. NVIDIA dropped Nemotron v3 Nano this morning. Is it just another checkpoint claiming SOTA? Not really. What makes this release incredible is that we're shipping the entire training stack behind it: the RL infra, the

Jared Roesch (@roeschinc) 's Twitter Profile Photo

Thrilled to announce we're open-sourcing the CUDA Tile dialect and bytecode! github.com/NVIDIA/cuda-ti… What's included:     • CUDA Tile MLIR dialect     • Bytecode serialization/deserialization support     • MLIR Python bindings for programmatic IR construction     •