Micah Carroll (@micahcarroll) 's Twitter Profile
Micah Carroll

@micahcarroll

AI PhD student @berkeley_ai /w @ancadianadragan & Stuart Russell. Working on AI safety ⊃ preference changes/AI manipulation.

ID: 356711942

linkhttp://micahcarroll.github.io calendar_today17-08-2011 07:40:21

657 Tweet

1,1K Takipçi

669 Takip Edilen

Jacy Reese Anthis (@jacyanthis) 's Twitter Profile Photo

Should we use LLMs 🤖 to simulate human research subjects 🧑? In our new preprint, we argue sims can augment human studies to scale up social science as AI technology accelerates. We identify five tractable challenges and argue this is a promising and underused research method 🧵

Should we use LLMs 🤖 to simulate human research subjects 🧑? In our new preprint, we argue sims can augment human studies to scale up social science as AI technology accelerates. We identify five tractable challenges and argue this is a promising and underused research method 🧵
Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

Tom Everitt (@tom4everitt) 's Twitter Profile Photo

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it?

In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵
Joe Edelman (@edelwax) 's Twitter Profile Photo

Long-term user satisfaction is *also* not the right metric. It will correlate with addiction. You gotta measure flourishing, according to the user's values, in areas where they collab w. ChatGPT.

Thomas Kleine Buening (@thomasklbg) 's Twitter Profile Photo

The 2nd Workshop on Models of Human Feedback for AI Alignment will take place at ICML Conference 2025 on 18/19 July in Vancouver! Submit here: openreview.net/group?id=ICML.… 📅Deadline: May 25th, 2025 (AoE) 🔗More Info: sites.google.com/view/mhf-icml2… Hope to see you in Vancouver!

The 2nd Workshop on Models of Human Feedback for AI Alignment will take place at <a href="/icmlconf/">ICML Conference</a> 2025 on 18/19 July in Vancouver! 

Submit here: openreview.net/group?id=ICML.… 
📅Deadline: May 25th, 2025 (AoE) 
🔗More Info: sites.google.com/view/mhf-icml2… 

Hope to see you in Vancouver!
Michaël Trazzi (@michaeltrazzi) 's Twitter Profile Photo

"SB-1047: The Battle For The Future of AI" Full Documentary uncovering what really happened behind the scenes of the SB-1047 debate, now available on X This project is the culmination of 8 months of work, 20+ interviews, and is probably the best video I've ever made. Enjoy!

Adam Gleave (@argleave) 's Twitter Profile Photo

My colleague Ian McKenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

My colleague <a href="/irobotmckenzie/">Ian McKenzie</a> spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave &gt;15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.
Hannah Rose Kirk (@hannahrosekirk) 's Twitter Profile Photo

Why do human–AI relationships need socioaffective alignment? As AI evolves from tools to companions, we must seek systems that enhance rather than exploit our nature as social & emotional beings. Published today in nature Humanities & Social Sciences! nature.com/articles/s4159…

Why do human–AI relationships need socioaffective alignment? As AI evolves from tools to companions, we must seek systems that enhance rather than exploit our nature as social &amp; emotional beings. Published today in <a href="/Nature/">nature</a> Humanities &amp; Social Sciences! nature.com/articles/s4159…
Robert Kirk (@_robertkirk) 's Twitter Profile Photo

New paper! With Joshua Clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵

New paper! With <a href="/joshua_clymer/">Joshua Clymer</a>, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
David Duvenaud (@davidduvenaud) 's Twitter Profile Photo

What to do about gradual disempowerment? We laid out a research agenda with all the concrete and feasible research projects we can think of: 🧵 with Raymond Douglas Jan Kulveit David Krueger

Nitasha Tiku (@nitashatiku) 's Twitter Profile Photo

AI is speedrunning the social media era by optimizing chatbots for engagement, user feedback, + time spent. Evidence is mounting that this poses unintended risks, including chats from peer-reviewed research, OpenAI's "sycophancy" debacle, & Character ai lawsuits

Micah Carroll (@micahcarroll) 's Twitter Profile Photo

LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle. Grateful our research on this was featured by Nitasha Tiku & The Washington Post!

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

I've been really feeling how much the general public is concerned about AI risk... In a *weird* amount of recent interactions with normal people (eg my hairdresser) when I say I do AI research (*not* safety), they ask if AI will take over Alas, I have no reassurances to offer

Hannah Rose Kirk (@hannahrosekirk) 's Twitter Profile Photo

A great The Washington Post story to be quoted in. I spoke to Nitasha Tiku re our work on human-AI relationships as well as early results from our University of Oxford survey of 2k UK citizens showing ~30% have sought AI companionship, emotional support or social interaction in the past year

Iason Gabriel (@iasongabriel) 's Twitter Profile Photo

1. How can we remain healthy and free while engaging in extended personal interaction with AI agents that shape our behaviour and preferences? One answer is "socioaffective alignment" as discussed in our new paper nature Humanities & Social Sciences! nature.com/articles/s4159…

1. How can we remain healthy and free while engaging in extended personal interaction with AI agents that shape our behaviour and preferences?

One answer is "socioaffective alignment" as discussed in our new paper <a href="/Nature/">nature</a> Humanities &amp; Social Sciences!

nature.com/articles/s4159…
METR (@metr_evals) 's Twitter Profile Photo

We already find it hard to understand what the model is doing and whether a high score is due to a clever optimization or a brittle hack. As models get more capable, it will become increasingly difficult to determine what is reward hacking and what is intended behavior.