Simeng Sun (@simeng_ssun) 's Twitter Profile
Simeng Sun

@simeng_ssun

Research Scientist @nvidia. ex: PhD @UMassCS; Intern @MSFTResearch, @MetaAI, @AdobeResearch.

ID: 1134811335408197632

linkhttps://simengsun.github.io calendar_today01-06-2019 13:18:00

187 Tweet

468 Followers

590 Following

Oleksii Kuchaiev (@kuchaev) 's Twitter Profile Photo

We are excited to release Llama-Nemotron-Ultra! This is a reasoning ON/OFF, dense 253B model. Open weights and post-training data. huggingface.co/nvidia/Llama-3… We started with llama-405B, changed it via NAS pruning then followed by reasoning-focused post-training: SFT + RL in FP8.

We are excited to release Llama-Nemotron-Ultra! This is a reasoning ON/OFF, dense 253B model. Open weights and post-training data. huggingface.co/nvidia/Llama-3… We started with llama-405B, changed it via NAS pruning then followed by reasoning-focused post-training: SFT + RL in FP8.
Pavlo Molchanov (@pavlomolchanov) 's Twitter Profile Photo

New efficient Hybrid LLMs from @NVIDIA: Nemotron-H! Introducing a family of models combining Mamba-2, Self-Attention & FFNs for 8B, 47B and 56B sizes. • 3x faster and 1.5x smaller 47B model is on par with Qwen-72B and Llama-70B • 1.8x faster Hybrid 8B than transformers

New efficient Hybrid LLMs from @NVIDIA: Nemotron-H! Introducing a family of models combining Mamba-2, Self-Attention & FFNs for 8B, 47B and 56B sizes.

• 3x faster and 1.5x smaller 47B model is on par with Qwen-72B and Llama-70B
• 1.8x faster Hybrid 8B than transformers
Nathan Lambert (@natolambert) 's Twitter Profile Photo

First draft online version of The RLHF Book is DONE. Recently I've been creating the advanced discussion chapters on everything from Constitutional AI to evaluation and character training, but I also sneak in consistent improvements to the RL specific chapter.

First draft online version of The RLHF Book is DONE. Recently I've been creating the advanced discussion chapters on everything from Constitutional AI to evaluation and character training, but I also sneak in consistent improvements to the RL specific chapter.
Kabir (@kabirahuja004) 's Twitter Profile Photo

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

📢 New Paper!

Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎

W/ <a href="/melaniesclar/">Melanie Sclar</a>, and <a href="/tsvetshop/">tsvetshop</a>

1/n
Shizhe Diao (@shizhediao) 's Twitter Profile Photo

Thrilled to share my first project at NVIDIA! ✨ Today’s language models are pre-trained on vast and chaotic Internet texts, but these texts are unstructured and poorly understood. We propose CLIMB — Clustering-based Iterative Data Mixture Bootstrapping — a fully automated

Thrilled to share my first project at NVIDIA! ✨

Today’s language models are pre-trained on vast and chaotic Internet texts, but these texts are unstructured and poorly understood. We propose CLIMB — Clustering-based Iterative Data Mixture Bootstrapping — a fully automated
Melanie Sclar (@melaniesclar) 's Twitter Profile Photo

See our work on procedurally generating challenging reasoning problems on detecting inconsistencies in stories! FlawedFictions is a great example what I'm most excited about: reliable synthetic data for reasoning in under-explored domains. (I'll be at ICLR to chat, DMs open!)

Tuhin Chakrabarty (@tuhinchakr) 's Twitter Profile Photo

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
Darragh (@gonedarragh) 's Twitter Profile Photo

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset abs: arxiv.org/abs/2504.16891 ‼️💹New 5.5M solution math reasoning dataset ‼️📈New models 1.5B/7B/14B/32B+ AIMO2-14b So much learning from this team & #aimoprize!

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
abs: arxiv.org/abs/2504.16891

‼️💹New 5.5M solution math reasoning dataset   
‼️📈New models 1.5B/7B/14B/32B+ AIMO2-14b  

So much learning from this team &amp; #aimoprize!
Yapei Chang (@yapeichang) 's Twitter Profile Photo

🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:

Daniel Khashabi 🕊️ (@danielkhashabi) 's Twitter Profile Photo

Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context. In our

Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context.

In our
Aryaman Arora (@aryaman2020) 's Twitter Profile Photo

new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!

new paper! 🫡

why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!
Shiyue Zhang (@byryuer) 's Twitter Profile Photo

🚀 New paper on evaluating retrieval robustness – how well LLMs handle imperfect retrieval: 1️⃣ RAG >= non-RAG? 2️⃣ More docs >= fewer docs? 3️⃣ Sensitivity to doc order ▶️ 11 LLMs × 3 prompting strategies Findings: LLMs show surprisingly high robustness—but limitations remain. 1/2

🚀 New paper on evaluating retrieval robustness – how well LLMs handle imperfect retrieval:
1️⃣ RAG &gt;= non-RAG?
2️⃣ More docs &gt;= fewer docs?
3️⃣ Sensitivity to doc order
▶️ 11 LLMs × 3 prompting strategies
Findings: LLMs show surprisingly high robustness—but limitations remain. 1/2
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

How much do language models memorize? "We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we

How much do language models memorize?

"We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we
Shizhe Diao (@shizhediao) 's Twitter Profile Photo

Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough! Introducing ProRL 😎, a novel training recipe that scales RL to >2k steps, empowering the world’s leading 1.5B reasoning model💥and offering

Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough!

Introducing ProRL 😎, a novel training recipe that scales RL to &gt;2k steps, empowering the world’s leading 1.5B reasoning model💥and offering
Tu Vu (@tuvllms) 's Twitter Profile Photo

✨ New paper ✨ 🚨 Scaling test-time compute can lead to inverse or flattened scaling!! We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways: ➡️ Frontier LLMs struggle on Seal-0 (SealQA’s

✨ New paper ✨
🚨 Scaling test-time compute can lead to inverse or flattened scaling!!

We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways:

➡️ Frontier LLMs struggle on Seal-0 (SealQA’s
Chau Minh Pham (@chautmpham) 's Twitter Profile Photo

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?

🧟 You get what we call a Frankentext!

💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
Mehrdad Farajtabar (@mfarajtabar) 's Twitter Profile Photo

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching?

The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,
Han Guo (@hanguo97) 's Twitter Profile Photo

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?

Introducing Log-Linear Attention with:

- Log-linear time training
- Log-time inference (in both time and memory)
- Hardware-efficient Triton kernels
Jackson Petty (@jowenpetty) 's Twitter Profile Photo

How well can LLMs understand tasks with complex sets of instructions? We investigate through the lens of RELIC: REcognizing (formal) Languages In-Context, finding a significant overhang between what LLMs are able to do theoretically and how well they put this into practice.

How well can LLMs understand tasks with complex sets of instructions? We investigate through the lens of RELIC: REcognizing (formal) Languages In-Context, finding a significant overhang between what LLMs are able to do theoretically and how well they put this into practice.