Simeng Sun (@simeng_ssun) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

We are excited to release Llama-Nemotron-Ultra! This is a reasoning ON/OFF, dense 253B model. Open weights and post-training data. huggingface.co/nvidia/Llama-3… We started with llama-405B, changed it via NAS pruning then followed by reasoning-focused post-training: SFT + RL in FP8.

thumb_up_off_alt694

chat_bubble_outline24

repeat125

shareShare

Pavlo Molchanov

@pavlomolchanov

2 months ago

New efficient Hybrid LLMs from @NVIDIA: Nemotron-H! Introducing a family of models combining Mamba-2, Self-Attention & FFNs for 8B, 47B and 56B sizes. • 3x faster and 1.5x smaller 47B model is on par with Qwen-72B and Llama-70B • 1.8x faster Hybrid 8B than transformers

thumb_up_off_alt309

chat_bubble_outline9

repeat90

shareShare

Nathan Lambert

@natolambert

2 months ago

First draft online version of The RLHF Book is DONE. Recently I've been creating the advanced discussion chapters on everything from Constitutional AI to evaluation and character training, but I also sneak in consistent improvements to the RL specific chapter.

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat214

shareShare

Kabir

@kabirahuja004

2 months ago

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

thumb_up_off_alt241

chat_bubble_outline3

repeat46

shareShare

Shizhe Diao

@shizhediao

2 months ago

Thrilled to share my first project at NVIDIA! ✨ Today’s language models are pre-trained on vast and chaotic Internet texts, but these texts are unstructured and poorly understood. We propose CLIMB — Clustering-based Iterative Data Mixture Bootstrapping — a fully automated

thumb_up_off_alt312

chat_bubble_outline17

repeat55

shareShare

Melanie Sclar

@melaniesclar

2 months ago

See our work on procedurally generating challenging reasoning problems on detecting inconsistencies in stories! FlawedFictions is a great example what I'm most excited about: reliable synthetic data for reasoning in under-explored domains. (I'll be at ICLR to chat, DMs open!)

thumb_up_off_alt56

chat_bubble_outline3

repeat9

shareShare

Tuhin Chakrabarty

@tuhinchakr

2 months ago

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

thumb_up_off_alt202

chat_bubble_outline10

repeat31

shareShare

Darragh

@gonedarragh

2 months ago

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset abs: arxiv.org/abs/2504.16891 ‼️💹New 5.5M solution math reasoning dataset ‼️📈New models 1.5B/7B/14B/32B+ AIMO2-14b So much learning from this team & #aimoprize!

thumb_up_off_alt68

chat_bubble_outline1

repeat15

shareShare

Yapei Chang

@yapeichang

a month ago

🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:

thumb_up_off_alt191

chat_bubble_outline6

repeat41

shareShare

Daniel Khashabi 🕊️

@danielkhashabi

a month ago

Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context. In our

thumb_up_off_alt51

chat_bubble_outline3

repeat17

shareShare

Aryaman Arora

@aryaman2020

a month ago

new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!

thumb_up_off_alt641

chat_bubble_outline11

repeat84

shareShare

Shiyue Zhang

@byryuer

a month ago

🚀 New paper on evaluating retrieval robustness – how well LLMs handle imperfect retrieval: 1️⃣ RAG >= non-RAG? 2️⃣ More docs >= fewer docs? 3️⃣ Sensitivity to doc order ▶️ 11 LLMs × 3 prompting strategies Findings: LLMs show surprisingly high robustness—but limitations remain. 1/2

thumb_up_off_alt45

chat_bubble_outline1

repeat9

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

24 days ago

How much do language models memorize? "We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we

thumb_up_off_alt852

chat_bubble_outline7

repeat135

shareShare

Shizhe Diao

@shizhediao

24 days ago

Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough! Introducing ProRL 😎, a novel training recipe that scales RL to >2k steps, empowering the world’s leading 1.5B reasoning model💥and offering

thumb_up_off_alt382

chat_bubble_outline17

repeat64

shareShare

Tu Vu

@tuvllms

23 days ago

✨ New paper ✨ 🚨 Scaling test-time compute can lead to inverse or flattened scaling!! We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways: ➡️ Frontier LLMs struggle on Seal-0 (SealQA’s

thumb_up_off_alt142

chat_bubble_outline4

repeat35

shareShare

Chau Minh Pham

@chautmpham

23 days ago

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

thumb_up_off_alt115

chat_bubble_outline4

repeat33

shareShare

Mehrdad Farajtabar

@mfarajtabar

21 days ago

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,

thumb_up_off_alt2,2K

chat_bubble_outline101

repeat532

shareShare

Han Guo

@hanguo97

21 days ago

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat185

shareShare

Jackson Petty

@jowenpetty

17 days ago

How well can LLMs understand tasks with complex sets of instructions? We investigate through the lens of RELIC: REcognizing (formal) Languages In-Context, finding a significant overhang between what LLMs are able to do theoretically and how well they put this into practice.

thumb_up_off_alt99

chat_bubble_outline3

repeat21

shareShare

Simeng Sun

Gate.io

Oleksii Kuchaiev

Pavlo Molchanov

Nathan Lambert

Kabir

Shizhe Diao

Melanie Sclar

Tuhin Chakrabarty

Darragh

Yapei Chang

Daniel Khashabi 🕊️

Aryaman Arora

Shiyue Zhang

Tanishq Mathew Abraham, Ph.D.

Shizhe Diao

Tu Vu

Chau Minh Pham

Mehrdad Farajtabar

Han Guo

Jackson Petty