Yihe Deng (@yihe__deng) 's Twitter Profile
Yihe Deng

@yihe__deng

CS PhD candidate @UCLA, Student Researcher @GoogleAI | Prev. Research Intern @MSFTResearch @AWS | LLM post-training, synthetic data

ID: 1462223072203722756

linkhttps://yihe-deng.notion.site/Yihe-Deng-167ab2d2c1fb80b3a76dfb120f716c84 calendar_today21-11-2021 00:55:36

175 Tweet

2,2K Takipçi

1,1K Takip Edilen

Yihe Deng (@yihe__deng) 's Twitter Profile Photo

😄I did a brief intro of RLHF algorithms for the reading group presentation of our lab. It was a good learning experience for me and I want to share the github repo here holds the slides as well as the list of interesting papers: github.com/yihedeng9/rlhf… Would love to hear about

😄I did a brief intro of RLHF algorithms for the reading group presentation of our lab. It was a good learning experience for me and I want to share the github repo here holds the slides as well as the list of interesting papers: github.com/yihedeng9/rlhf… 

Would love to hear about
Kaiyu Yang (@kaiyuyang4) 's Twitter Profile Photo

🚀 Excited to share our position paper: "Formal Mathematical Reasoning: A New Frontier in AI"! 🔗 arxiv.org/abs/2412.16075 LLMs like o1 & o3 have tackled hard math problems by scaling test-time compute. What's next for AI4Math? We advocate for formal mathematical reasoning,

Daniel Han (@danielhanchen) 's Twitter Profile Photo

Cool things from DeepSeek v3's paper: 1. Float8 uses E4M3 for forward & backward - no E5M2 2. Every 4th FP8 accumulate adds to master FP32 accum 3. Latent Attention stores C cache not KV cache 4. No MoE loss balancing - dynamic biases instead More details: 1. FP8: First large

Cool things from DeepSeek v3's paper:

1. Float8 uses E4M3 for forward & backward - no E5M2
2. Every 4th FP8 accumulate adds to master FP32 accum
3. Latent Attention stores C cache not KV cache
4. No MoE loss balancing - dynamic biases instead

More details:
1. FP8: First large
Zongyu Lin (@zy27962986) 's Twitter Profile Photo

Interested in the combination of Inference time scaling + LLM Agent?🤖💭 Announcing QLASS (Q-guided Language Agent Stepwise Search, arxiv.org/abs/2502.02584), a framework that supercharges language agents at inference time. ⚡In this work, we build a process reward model to guide

Interested in the combination of Inference time scaling + LLM Agent?🤖💭 Announcing QLASS (Q-guided Language Agent Stepwise Search,  arxiv.org/abs/2502.02584), a framework that supercharges language agents at inference time. ⚡In this work, we build a process reward model to guide
Yihe Deng (@yihe__deng) 's Twitter Profile Photo

New paper & model release! Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model. - Model: huggingface.co/DuoGuard/DuoGu… - Paper: arxiv.org/abs/2502.05163 - GitHub: github.com/yihedeng9/DuoG… Grounded in a

New paper & model release!

Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model.

- Model: huggingface.co/DuoGuard/DuoGu…
- Paper: arxiv.org/abs/2502.05163
- GitHub: github.com/yihedeng9/DuoG…

Grounded in a
Wanjia Zhao (@wanjiazhao1203) 's Twitter Profile Photo

Introducing #SIRIUS🌟: A self-improving multi-agent LLM framework that learns from successful interactions and refines failed trajectories, enhancing college-level reasoning and competitive negotiations. 📜Preprint: arxiv.org/pdf/2502.04780 💻code: github.com/zou-group/siri… 1/N

Introducing #SIRIUS🌟: A self-improving multi-agent LLM framework that learns from successful interactions and refines failed trajectories, enhancing college-level reasoning and competitive negotiations. 
📜Preprint: arxiv.org/pdf/2502.04780
💻code: github.com/zou-group/siri…
1/N
Yong Lin (@yong18850571) 's Twitter Profile Photo

🚀 Exciting news! Our Goedel-Prover paper is now live on arXiv: arxiv.org/pdf/2502.07640 🎉 We're currently developing the RL version and have a stronger checkpoint than before (currently not included in the report)!🚀🚀🚀 Plus, we’ll be open-sourcing 1.64M formalized

🚀 Exciting news! Our Goedel-Prover paper is now live on arXiv: arxiv.org/pdf/2502.07640 🎉 

We're currently developing the RL version and have  a stronger checkpoint than before (currently not included in the report)!🚀🚀🚀

Plus, we’ll be open-sourcing 1.64M formalized
DeepSeek (@deepseek_ai) 's Twitter Profile Photo

🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With

🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference!

Core components of NSA:
• Dynamic hierarchical sparse strategy
• Coarse-grained token compression
• Fine-grained token selection

💡 With
Ziniu Li @ ICLR2025 (@ziniuli) 's Twitter Profile Photo

🌟 Can better cold start strategies improve RL training for LLMs? 🤖 I’ve written a blog that delves into the challenges of fine-tuning LLMs during the cold-start phase and how the strategies applied there can significantly impact RL performance in complex reasoning tasks that

🌟 Can better cold start strategies improve RL training for LLMs? 🤖

I’ve written a blog that delves into the challenges of fine-tuning LLMs during the cold-start phase and how the strategies applied there can significantly impact RL performance in complex reasoning tasks that
Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

[1/n] SuperExcited to announce SuperGPQA!!! We spend more than half a year to finally make it done! SuperGPQA is a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. It also provides the largest human-LLM

[1/n]

SuperExcited to announce SuperGPQA!!!
We spend more than half a year to finally make it done!
SuperGPQA is a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
It also provides the largest human-LLM
Siyan Zhao (@siyan_zhao) 's Twitter Profile Photo

Excited to release PrefEval (ICLR '25 Oral), a benchmark for evaluating LLMs’ ability to infer, memorize, and adhere to user preferences in long-context conversations! ⚠️We find that cutting-edge LLMs struggle to follow user preferences—even in short contexts. This isn't just

Excited to release PrefEval (ICLR '25 Oral), a benchmark for evaluating LLMs’ ability to infer, memorize, and adhere to user preferences in long-context conversations!

⚠️We find that cutting-edge LLMs struggle to follow user preferences—even in short contexts. This isn't just
Zhiqing Sun (@edwardsun0909) 's Twitter Profile Photo

We’re rolling out Deep Research to Plus users today! Deep Research was the biggest “Feel The AGI” moment I’ve ever had since ChatGPT. And I’m glad more people will experience their first AGI moment! The team also worked super hard to make more tools including image citations /

Yihe Deng (@yihe__deng) 's Twitter Profile Photo

🤖 I just updated my repository of RL(HF) summary notes to include a growing exploration of new topics, specifically adding notes to projects related to DeepSeek R1 reasoning. Take a look: github.com/yihedeng9/rlhf… 🚀 I’m hoping these summaries are helpful, and I’d love to hear

🤖 I just updated my repository of RL(HF) summary notes to include a growing exploration of new topics, specifically adding notes to projects related to DeepSeek R1 reasoning. 

Take a look: github.com/yihedeng9/rlhf… 🚀

I’m hoping these summaries are helpful, and I’d love to hear
Siyan Zhao (@siyan_zhao) 's Twitter Profile Photo

Introducing d1🚀 — the first framework that applies reinforcement learning to improve reasoning in masked diffusion LLMs (dLLMs). Combining masked SFT with a novel form of policy gradient algorithm, d1 significantly boosts the performance of pretrained dLLMs like LLaDA.

Introducing d1🚀 — the first framework that applies reinforcement learning to improve reasoning in masked diffusion LLMs (dLLMs).

Combining masked SFT with a novel form of policy gradient algorithm, d1 significantly boosts the performance of pretrained dLLMs like LLaDA.