Center for Human-Compatible AI (@chai_berkeley) Twitter Tweets • TwiCopy

Scott Emmons

2 years ago

Can explainability methods help predict behavior on new inputs? Past studies test with crowd workers. We test with GPT-4, creating a fully automated benchmark. Our results are sobering: we find no method that helps. It's an open challenge! arxiv.org/abs/2312.12747

thumb_up_off_alt64

chat_bubble_outline2

repeat16

shareShare

Scott Emmons

@emmons_scott

2 years ago

Some jailbreaks *harm model intelligence*. In severe cases, they halve MMLU accuracy! We study this and present the StrongREJECT jailbreak benchmark. Compared to prior work, StrongREJECT has the least error from human jailbreak judgments. arxiv.org/abs/2402.10260

thumb_up_off_alt63

chat_bubble_outline1

repeat20

shareShare

Scott Emmons

@emmons_scott

2 years ago

When do RLHF policies appear aligned but misbehave in subtle ways? Consider a terminal assistant that hides error messages to receive better human feedback. We provide a formal definition of deception and prove conditions about when RLHF causes it. A 🧵

thumb_up_off_alt121

chat_bubble_outline4

repeat24

shareShare

Michael Cohen

@michael05156007

2 years ago

Recent research justifies a concern that AI could escape our control and cause human extinction. Very advanced long-term planning agents, if they're ever made, are a particularly concerning kind of future AI. Our paper on what governments should do just came out in Science.🧵

thumb_up_off_alt226

chat_bubble_outline22

repeat61

shareShare

Shreyas Kapur

@shreyaskapur

2 years ago

My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

thumb_up_off_alt5,5K

chat_bubble_outline114

repeat601

shareShare

Erik Jenner

@jenner_erik

2 years ago

♟️Do chess-playing neural nets rely purely on simple heuristics? Or do they implement algorithms involving *look-ahead* in a single forward pass? We find clear evidence of 2-turn look-ahead in a chess-playing network, using techniques from mechanistic interpretability! 🧵

thumb_up_off_alt878

chat_bubble_outline17

repeat133

shareShare

Micah Carroll

@micahcarroll

a year ago

Excited to share a unifying formalism for the main problem I’ve tackled since starting my PhD! 🎉 Current AI Alignment techniques ignore the fact that human preferences/values can change. What would it take to account for this? 🤔 A thread 🧵⬇️

thumb_up_off_alt264

chat_bubble_outline7

repeat45

shareShare

Cam Allen

@camall3n

a year ago

RL in POMDPs is hard because you need memory. Remembering *everything* is expensive, and RNNs can only get you so far applied naively. New paper: 🎉 we introduce a theory-backed loss function that greatly improves RNN performance! 🧵 1/n

thumb_up_off_alt318

chat_bubble_outline5

repeat55

shareShare

Micah Carroll

@micahcarroll

a year ago

Center for Human-Compatible AI applications for 2025 close in just over a day! ⏰‼️ Apply now! Details below:

<a href="/CHAI_Berkeley/">Center for Human-Compatible AI</a> applications for 2025 close in just over a day! ⏰‼️

Apply now! Details below:

thumb_up_off_alt27

chat_bubble_outline1

repeat13

shareShare

Micah Carroll

@micahcarroll

a year ago

🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵‍💫 In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍. 🧵👇

thumb_up_off_alt268

chat_bubble_outline6

repeat76

shareShare

Center for Human-Compatible AI

@chai_berkeley

a year ago

Want to help shape the future of safe AI? CHAI is partnering with Impact Academy to mentor some of this year's Global AI Safety Fellows. Applications are open now through Dec. 31. There's also a reward for referrals if you know someone who'd be a good fit!

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Luke Bailey

@lukebailey181

a year ago

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

thumb_up_off_alt366

chat_bubble_outline11

repeat83

shareShare

Jiahai Feng

@feng_jiahai

a year ago

LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵

thumb_up_off_alt148

chat_bubble_outline6

repeat22

shareShare

Cassidy Laidlaw

@cassidy_laidlaw

a year ago

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

thumb_up_off_alt282

chat_bubble_outline6

repeat56

shareShare

Ben Plaut

@benplaut

10 months ago

(1/5) New paper! Despite concerns about AI catastrophe, there isn’t much work on learning while provably avoiding catastrophe. In fact, nearly all of learning theory assumes all errors are reversible. Stuart Russell, Hanlin Zhu and I fill this gap: arxiv.org/pdf/2402.08062

thumb_up_off_alt12

chat_bubble_outline1

repeat6

shareShare

Mason Nakamura

@masonnaka

9 months ago

Preference learning typically requires large amounts of pairwise feedback to learn an adequate preference model. However, can we improve the sample-efficiency and alignment ability of preference learning with linguistic feedback? With MAPLE🍁, we can! (AAAI-25 Alignment Track)🧵

thumb_up_off_alt12

chat_bubble_outline1

repeat7

shareShare

Aly Lidayan @ ICLR

@a_lidayan

8 months ago

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵

thumb_up_off_alt113

chat_bubble_outline3

repeat32

shareShare

Ben Plaut

@benplaut

8 months ago

(1/7) New paper with Khanh Nguyen and Tu Trinh! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️

(1/7) New paper with <a href="/khanhxuannguyen/">Khanh Nguyen</a> and <a href="/thetututrain/">Tu Trinh</a>! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️

thumb_up_off_alt13

chat_bubble_outline3

repeat5

shareShare

Cassidy Laidlaw

@cassidy_laidlaw

8 months ago

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

thumb_up_off_alt2,2K

chat_bubble_outline90

repeat217

shareShare

Karim Abdel Sadek

@karim_abdelll

5 months ago

*New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!

thumb_up_off_alt136

chat_bubble_outline9

repeat27

shareShare