Yangsibo Huang (@yangsibohuang) 's Twitter Profile
Yangsibo Huang

@yangsibohuang

Research scientist @GoogleAI. Gemini thinking & multilinguality. Prev: PhD from @Princeton. Opinions are my own.

ID: 2878031881

linkhttp://hazelsuko07.github.io/yangsibo/ calendar_today26-10-2014 08:09:08

405 Tweet

4,4K Followers

746 Following

Stephanie Chan (@scychan_brains) 's Twitter Profile Photo

Devastatingly, we have lost a bright light in our field. Felix Hill was not only a deeply insightful thinker -- he was also a generous, thoughtful mentor to many researchers. He majorly changed my life, and I can't express how much I owe to him. Even now, Felix still has so much

Wei Qiu (@weiqiu55) 's Twitter Profile Photo

📢I am on the academic job market this year! My research interest involves utilizing AI and explainable AI to explore the mechanisms of aging and age-related diseases. I'm looking for faculty positions in AI for Biomedicine. Check out my website: qiuweipku.github.io

Noam Shazeer (@noamshazeer) 's Twitter Profile Photo

Your feedback on Gemini 2.0 Flash Thinking has been incredible—thank you! We’ve taken your suggestions and made an experimental update…

Yangsibo Huang (@yangsibohuang) 's Twitter Profile Photo

LLM safety guardrails can be easily removed through fine-tuning. While defenses have been proposed, our #ICLR2025 paper shows flawed evaluations can create a false sense of security. Check out the thread by Boyi Wei for more details 🧵

Kaixuan Huang (@kaixuanhuang1) 's Twitter Profile Photo

Do LLMs have true generalizable mathematical reasoning capability or are they merely memorizing problem-solving skills? 🤨 We present MATH-Perturb, modified level-5 problems from MATH dataset to benchmark LLMs' generalizability to slightly perturbed problems. 🔗

Do LLMs have true generalizable mathematical reasoning capability or are they merely memorizing problem-solving skills? 🤨

We present MATH-Perturb, modified level-5 problems from MATH dataset to benchmark LLMs' generalizability to slightly perturbed problems.

🔗
Yangsibo Huang (@yangsibohuang) 's Twitter Profile Photo

Think we're done with Hendrycks MATH? Well, we show that expert perturbations of the benchmark can drop frontier model accuracy by ~15% (Gemini thinking, OpenAI o1 etc.). We attribute this to skill memorization.

Kaixuan Huang (@kaixuanhuang1) 's Twitter Profile Photo

Just tested Llama4-Scout on our MATH-Perturb benchmark. There is a surprising 18% gap between Original and MATH-P-Simple, making it unique among the 20+ models that came out after 2024. 😂😂 🔗Leaderboard available at math-perturb.github.io. x.com/KaixuanHuang1/…

Just tested Llama4-Scout on our MATH-Perturb benchmark. There is a surprising 18% gap between Original and MATH-P-Simple, making it unique among the 20+ models that came out after 2024. 😂😂 

🔗Leaderboard available at math-perturb.github.io. 

x.com/KaixuanHuang1/…
Christopher Choquette @ ICLR25 (@chris_choquette) 's Twitter Profile Photo

Our team, Google DeepMind Privacy & Security Research, is hiring for several roles, including one to work with me on privacy & memorization auditing! Please reach out for more details... And if you're at #ICLR2025, we can meet to chat about them :)

Princeton Computer Science (@princetoncs) 's Twitter Profile Photo

Congrats to Kai Li on being named a member of the American Academy of Arts & Sciences! 🎉 Li joined Princeton University in 1986 and has made important contributions to several research areas in computer science. bit.ly/3RPLxas

Congrats to Kai Li on being named a member of the American Academy of Arts & Sciences! 🎉

Li joined <a href="/Princeton/">Princeton University</a> in 1986 and has made important contributions to several research areas in computer science. 
 
bit.ly/3RPLxas
Jack Rae (@jack_w_rae) 's Twitter Profile Photo

Today Demis announced Deep Think which marks our progression to greater test-time compute and stronger reasoning capabilities in Gemini 💎 Highlighting USAMO which is a very challenging set of held-out math problems, we're now at 49% accuracy. This is equivalent to the top

Today Demis announced Deep Think which marks our progression to greater test-time compute and stronger reasoning capabilities in Gemini 💎

Highlighting USAMO which is a very challenging set of held-out math problems, we're now at 49% accuracy. This is equivalent to the top
Sundar Pichai (@sundarpichai) 's Twitter Profile Photo

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦 Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦

Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the
Thang Luong (@lmthang) 's Twitter Profile Photo

Yes, there is an official marking guideline from the IMO organizers which is not available externally. Without the evaluation based on that guideline, no medal claim can be made. With one point deducted, it is a Silver, not Gold.

Ankesh Anand (@ankesh_anand) 's Twitter Profile Photo

We can finally share this now: A Gemini model trained with new RL techniques and scaled up inference-time compute model has achieved gold-medal level performance at IMO 2025! 🥇

We can finally share this now: 
A Gemini model trained with new RL techniques and scaled up inference-time compute model has achieved gold-medal level performance at IMO 2025! 🥇
Fred Zhang (@fredzhang0) 's Twitter Profile Photo

This is the most scaling-pilled project I've ever been part of, and the team really cooked. TL;DR: With RL and inference scaling, Gemini perfectly solved 5 out of 6 problems, reaching a gold medal in IMO '25, all within the time constraints of 4.5hr.

Dawsen Hwang (@dawsenhwang) 's Twitter Profile Photo

From being a kid passionate about IMO problems to now helping lead the effort at Google DeepMind to get an AI to that same level—what a journey. Thanks to my brilliant coworkers & the IMO board. Excited to see how AI will push the frontiers of science for humanity.

Yangsibo Huang (@yangsibohuang) 's Twitter Profile Photo

Gemini 2.5 Deep Think is available to Ultra users! It achieves SOTA on HLE (no tools), LiveCodeBench, and math/proofs. Time to give it a try and let us know your feedback! We’ve also made the IMO gold model available to mathematicians and other domain experts :)👩‍🍳