Yangsibo Huang (@yangsibohuang) Twitter Tweets • TwiCopy

Stephanie Chan

8 months ago

Devastatingly, we have lost a bright light in our field. Felix Hill was not only a deeply insightful thinker -- he was also a generous, thoughtful mentor to many researchers. He majorly changed my life, and I can't express how much I owe to him. Even now, Felix still has so much

thumb_up_off_alt614

chat_bubble_outline6

repeat93

shareShare

Yangsibo Huang

@yangsibohuang

8 months ago

A great opportunity for early-career faculty—apply by Jan 27th!

thumb_up_off_alt32

chat_bubble_outline0

repeat1

shareShare

Wei Qiu

@weiqiu55

7 months ago

📢I am on the academic job market this year! My research interest involves utilizing AI and explainable AI to explore the mechanisms of aging and age-related diseases. I'm looking for faculty positions in AI for Biomedicine. Check out my website: qiuweipku.github.io

thumb_up_off_alt191

chat_bubble_outline5

repeat34

shareShare

Noam Shazeer

@noamshazeer

7 months ago

Your feedback on Gemini 2.0 Flash Thinking has been incredible—thank you! We’ve taken your suggestions and made an experimental update…

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat62

shareShare

Yangsibo Huang

@yangsibohuang

7 months ago

LLM safety guardrails can be easily removed through fine-tuning. While defenses have been proposed, our #ICLR2025 paper shows flawed evaluations can create a false sense of security. Check out the thread by Boyi Wei for more details 🧵

thumb_up_off_alt69

chat_bubble_outline0

repeat7

shareShare

Luxi (Lucy) He

@luxihelucy

7 months ago

Excited to share that our paper is accepted at ICLR 2026 #ICLR2025! See you in Singapore!

thumb_up_off_alt114

chat_bubble_outline5

repeat12

shareShare

Kaixuan Huang

@kaixuanhuang1

7 months ago

Do LLMs have true generalizable mathematical reasoning capability or are they merely memorizing problem-solving skills? 🤨 We present MATH-Perturb, modified level-5 problems from MATH dataset to benchmark LLMs' generalizability to slightly perturbed problems. 🔗

thumb_up_off_alt918

chat_bubble_outline27

repeat133

shareShare

Yangsibo Huang

@yangsibohuang

7 months ago

Think we're done with Hendrycks MATH? Well, we show that expert perturbations of the benchmark can drop frontier model accuracy by ~15% (Gemini thinking, OpenAI o1 etc.). We attribute this to skill memorization.

thumb_up_off_alt82

chat_bubble_outline1

repeat8

shareShare

Kaixuan Huang

@kaixuanhuang1

5 months ago

Just tested Llama4-Scout on our MATH-Perturb benchmark. There is a surprising 18% gap between Original and MATH-P-Simple, making it unique among the 20+ models that came out after 2024. 😂😂 🔗Leaderboard available at math-perturb.github.io. x.com/KaixuanHuang1/…

thumb_up_off_alt234

chat_bubble_outline13

repeat34

shareShare

Christopher Choquette @ ICLR25

@chris_choquette

4 months ago

Our team, Google DeepMind Privacy & Security Research, is hiring for several roles, including one to work with me on privacy & memorization auditing! Please reach out for more details... And if you're at #ICLR2025, we can meet to chat about them :)

thumb_up_off_alt152

chat_bubble_outline1

repeat10

shareShare

Princeton Computer Science

@princetoncs

4 months ago

Congrats to Kai Li on being named a member of the American Academy of Arts & Sciences! 🎉 Li joined Princeton University in 1986 and has made important contributions to several research areas in computer science. bit.ly/3RPLxas

Congrats to Kai Li on being named a member of the American Academy of Arts & Sciences! 🎉

Li joined <a href="/Princeton/">Princeton University</a> in 1986 and has made important contributions to several research areas in computer science.

bit.ly/3RPLxas

thumb_up_off_alt108

chat_bubble_outline1

repeat16

shareShare

Yangsibo Huang

@yangsibohuang

4 months ago

Appetizers for Google I/O: Gemini nailed Pokémon Blue and swept LMArena. And definitely we are cooking more🧑‍🍳

thumb_up_off_alt64

chat_bubble_outline1

repeat2

shareShare

Jack Rae

@jack_w_rae

3 months ago

Today Demis announced Deep Think which marks our progression to greater test-time compute and stronger reasoning capabilities in Gemini 💎 Highlighting USAMO which is a very challenging set of held-out math problems, we're now at 49% accuracy. This is equivalent to the top

thumb_up_off_alt374

chat_bubble_outline37

repeat41

shareShare

Sundar Pichai

@sundarpichai

3 months ago

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦 Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the

thumb_up_off_alt4,4K

chat_bubble_outline170

repeat445

shareShare

Thang Luong

@lmthang

a month ago

Yes, there is an official marking guideline from the IMO organizers which is not available externally. Without the evaluation based on that guideline, no medal claim can be made. With one point deducted, it is a Silver, not Gold.

thumb_up_off_alt591

chat_bubble_outline15

repeat56

shareShare

Ankesh Anand

@ankesh_anand

a month ago

We can finally share this now: A Gemini model trained with new RL techniques and scaled up inference-time compute model has achieved gold-medal level performance at IMO 2025! 🥇

thumb_up_off_alt469

chat_bubble_outline14

repeat29

shareShare

Fred Zhang

@fredzhang0

a month ago

This is the most scaling-pilled project I've ever been part of, and the team really cooked. TL;DR: With RL and inference scaling, Gemini perfectly solved 5 out of 6 problems, reaching a gold medal in IMO '25, all within the time constraints of 4.5hr.

thumb_up_off_alt535

chat_bubble_outline17

repeat28

shareShare

Dawsen Hwang

@dawsenhwang

a month ago

From being a kid passionate about IMO problems to now helping lead the effort at Google DeepMind to get an AI to that same level—what a journey. Thanks to my brilliant coworkers & the IMO board. Excited to see how AI will push the frontiers of science for humanity.

thumb_up_off_alt73

chat_bubble_outline5

repeat6

shareShare

Yangsibo Huang

@yangsibohuang

a month ago

Gemini 2.5 Deep Think is available to Ultra users! It achieves SOTA on HLE (no tools), LiveCodeBench, and math/proofs. Time to give it a try and let us know your feedback! We’ve also made the IMO gold model available to mathematicians and other domain experts :)👩‍🍳

thumb_up_off_alt23

chat_bubble_outline0

repeat1

shareShare