
Micah Carroll
@micahcarroll
AI PhD student @berkeley_ai /w @ancadianadragan & Stuart Russell. Working on AI safety ⊃ preference changes/AI manipulation.
ID: 356711942
http://micahcarroll.github.io 17-08-2011 07:40:21
657 Tweet
1,1K Takipçi
669 Takip Edilen





The 2nd Workshop on Models of Human Feedback for AI Alignment will take place at ICML Conference 2025 on 18/19 July in Vancouver! Submit here: openreview.net/group?id=ICML.… 📅Deadline: May 25th, 2025 (AoE) 🔗More Info: sites.google.com/view/mhf-icml2… Hope to see you in Vancouver!




My colleague Ian McKenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.



New paper! With Joshua Clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵


What to do about gradual disempowerment? We laid out a research agenda with all the concrete and feasible research projects we can think of: 🧵 with Raymond Douglas Jan Kulveit David Krueger


LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle. Grateful our research on this was featured by Nitasha Tiku & The Washington Post!


A great The Washington Post story to be quoted in. I spoke to Nitasha Tiku re our work on human-AI relationships as well as early results from our University of Oxford survey of 2k UK citizens showing ~30% have sought AI companionship, emotional support or social interaction in the past year

