Center for Human-Compatible AI (@chai_berkeley) 's Twitter Profile
Center for Human-Compatible AI

@chai_berkeley

CHAI is a multi-institute research organization based out of UC Berkeley that focuses on foundational research for AI technical safety.

ID: 1058055843466244096

linkhttp://humancompatible.ai calendar_today01-11-2018 17:59:05

200 Tweet

3,3K Followers

109 Following

Scott Emmons (@emmons_scott) 's Twitter Profile Photo

Can explainability methods help predict behavior on new inputs? Past studies test with crowd workers. We test with GPT-4, creating a fully automated benchmark. Our results are sobering: we find no method that helps. It's an open challenge! arxiv.org/abs/2312.12747

Can explainability methods help predict behavior on new inputs?

Past studies test with crowd workers. We test with GPT-4, creating a fully automated benchmark.

Our results are sobering: we find no method that helps. It's an open challenge!

arxiv.org/abs/2312.12747
Scott Emmons (@emmons_scott) 's Twitter Profile Photo

Some jailbreaks *harm model intelligence*. In severe cases, they halve MMLU accuracy! We study this and present the StrongREJECT jailbreak benchmark. Compared to prior work, StrongREJECT has the least error from human jailbreak judgments. arxiv.org/abs/2402.10260

Some jailbreaks *harm model intelligence*. In severe cases, they halve MMLU accuracy!

We study this and present the StrongREJECT jailbreak benchmark. Compared to prior work, StrongREJECT has the least error from human jailbreak judgments.

arxiv.org/abs/2402.10260
Scott Emmons (@emmons_scott) 's Twitter Profile Photo

When do RLHF policies appear aligned but misbehave in subtle ways? Consider a terminal assistant that hides error messages to receive better human feedback. We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵

When do RLHF policies appear aligned but misbehave in subtle ways?

Consider a terminal assistant that hides error messages to receive better human feedback.

We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵
Michael Cohen (@michael05156007) 's Twitter Profile Photo

Recent research justifies a concern that AI could escape our control and cause human extinction. Very advanced long-term planning agents, if they're ever made, are a particularly concerning kind of future AI. Our paper on what governments should do just came out in Science.🧵

Recent research justifies a concern that AI could escape our control and cause human extinction. Very advanced long-term planning agents, if they're ever made, are a particularly concerning kind of future AI. Our paper on what governments should do just came out in Science.🧵
Shreyas Kapur (@shreyaskapur) 's Twitter Profile Photo

My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

Erik Jenner (@jenner_erik) 's Twitter Profile Photo

♟️Do chess-playing neural nets rely purely on simple heuristics? Or do they implement algorithms involving *look-ahead* in a single forward pass? We find clear evidence of 2-turn look-ahead in a chess-playing network, using techniques from mechanistic interpretability! 🧵

Micah Carroll (@micahcarroll) 's Twitter Profile Photo

Excited to share a unifying formalism for the main problem I’ve tackled since starting my PhD! 🎉 Current AI Alignment techniques ignore the fact that human preferences/values can change. What would it take to account for this? 🤔 A thread 🧵⬇️

Excited to share a unifying formalism for the main problem I’ve tackled since starting my PhD! 🎉

Current AI Alignment techniques ignore the fact that human preferences/values can change. What would it take to account for this? 🤔

A thread 🧵⬇️
Cam Allen (@camall3n) 's Twitter Profile Photo

RL in POMDPs is hard because you need memory. Remembering *everything* is expensive, and RNNs can only get you so far applied naively. New paper: 🎉 we introduce a theory-backed loss function that greatly improves RNN performance! 🧵 1/n

Micah Carroll (@micahcarroll) 's Twitter Profile Photo

🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵‍💫 In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍. 🧵👇

🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵‍💫

In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍.

🧵👇
Center for Human-Compatible AI (@chai_berkeley) 's Twitter Profile Photo

Want to help shape the future of safe AI? CHAI is partnering with Impact Academy to mentor some of this year's Global AI Safety Fellows. Applications are open now through Dec. 31. There's also a reward for referrals if you know someone who'd be a good fit!

Luke Bailey (@lukebailey181) 's Twitter Profile Photo

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

Jiahai Feng (@feng_jiahai) 's Twitter Profile Photo

LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵

LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵
Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

Ben Plaut (@benplaut) 's Twitter Profile Photo

(1/5) New paper! Despite concerns about AI catastrophe, there isn’t much work on learning while provably avoiding catastrophe. In fact, nearly all of learning theory assumes all errors are reversible. Stuart Russell, Hanlin Zhu and I fill this gap: arxiv.org/pdf/2402.08062

Mason Nakamura (@masonnaka) 's Twitter Profile Photo

Preference learning typically requires large amounts of pairwise feedback to learn an adequate preference model. However, can we improve the sample-efficiency and alignment ability of preference learning with linguistic feedback? With MAPLE🍁, we can! (AAAI-25 Alignment Track)🧵

Preference learning typically requires large amounts of pairwise feedback to learn an adequate preference model. However, can we improve the sample-efficiency and alignment ability of preference learning with linguistic feedback? With MAPLE🍁, we can! (AAAI-25 Alignment Track)🧵
Aly Lidayan @ ICLR (@a_lidayan) 's Twitter Profile Photo

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵
Ben Plaut (@benplaut) 's Twitter Profile Photo

(1/7) New paper with Khanh Nguyen and Tu Trinh! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️

(1/7) New paper with <a href="/khanhxuannguyen/">Khanh Nguyen</a> and <a href="/thetututrain/">Tu Trinh</a>! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️
Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

Karim Abdel Sadek (@karim_abdelll) 's Twitter Profile Photo

*New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!