Cam Allen (@camall3n) 's Twitter Profile
Cam Allen

@camall3n

Aspiring mathemagician

ID: 15484072

linkhttp://camallen.net calendar_today18-07-2008 17:59:21

84 Tweet

243 Takipçi

43 Takip Edilen

Micah Carroll (@micahcarroll) 's Twitter Profile Photo

🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵‍💫 In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍. 🧵👇

🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵‍💫

In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍.

🧵👇
Luke Bailey (@lukebailey181) 's Twitter Profile Photo

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

Jiahai Feng (@feng_jiahai) 's Twitter Profile Photo

LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵

LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵
Sasha Rush (@srush_nlp) 's Twitter Profile Photo

Rare sincere tweet: December can be tough in academia. As a student I thought everyone had it together. As an advisor you see that is very much not true. Generally, at least as a starting place, a really recommend finding someone who you can go on a long walk with to talk it

Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

Cam Allen (@camall3n) 's Twitter Profile Photo

It's not enough to have a high-fidelity representation of your decision problem. You also need one that allows you to identify and apply an appropriate strategy for solving that decision problem.

Ben Plaut (@benplaut) 's Twitter Profile Photo

(1/5) New paper! Despite concerns about AI catastrophe, there isn’t much work on learning while provably avoiding catastrophe. In fact, nearly all of learning theory assumes all errors are reversible. Stuart Russell, Hanlin Zhu and I fill this gap: arxiv.org/pdf/2402.08062

Robert Long (@rgblong) 's Twitter Profile Photo

Great to see OpenAI’s new model spec taking a more nuanced stance on AI consciousness! At Eleos AI Research, we've been recommending that labs not force LLMs to categorically deny (or assert!) consciousness—especially with spurious arguments like “As an AI assistant, I am not conscious”

Great to see OpenAI’s new model spec taking a more nuanced stance on AI consciousness!

At <a href="/eleosai/">Eleos AI Research</a>, we've been recommending that labs not force LLMs to categorically deny (or assert!) consciousness—especially with spurious arguments like “As an AI assistant, I am not conscious”
Aly Lidayan @ ICLR (@a_lidayan) 's Twitter Profile Photo

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵
Ben Plaut (@benplaut) 's Twitter Profile Photo

(1/7) New paper with Khanh Nguyen and Tu Trinh! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️

(1/7) New paper with <a href="/khanhxuannguyen/">Khanh Nguyen</a> and <a href="/thetututrain/">Tu Trinh</a>! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️
Cam Allen (@camall3n) 's Twitter Profile Photo

Important question: if Peter Piper picked a peck of pickled peppers, does this not imply that the peppers Peter Piper picked were already pickled, pre-picking? How is pre-picking pickling possible?

Cassidy Laidlaw (@cassidy_laidlaw) 's Twitter Profile Photo

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

Benjamin Spiegel (@superspeeg) 's Twitter Profile Photo

Why did only humans invent graphical systems like writing? 🧠✍️ In our new paper at CogSci Society, we explore how agents learn to communicate using a model of pictographic signification similar to human proto-writing. 🧵👇

Chris Anderson (@tedchris) 's Twitter Profile Photo

The debate about AI safety is as important as they come. I can't recommend strongly enough this blockbuster talk at TED this year by Tristan Harris. go.ted.com/tristanharris25 If you know anyone influential in AI, please forward this....

Karim Abdel Sadek (@karim_abdelll) 's Twitter Profile Photo

*New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!

David Tao (@taodav) 's Twitter Profile Photo

What does it mean to be “better at” partial observability in RL? Existing benchmarks don't always provide a clear signal for progress. We fix that. Our new work (at RLC 2025 🤖) introduces a new property that ensures your gains are from learning better memory vs other factors.

What does it mean to be “better at” partial observability in RL? Existing benchmarks don't always provide a clear signal for progress. We fix that. 
Our new work (at RLC 2025 🤖) introduces a new property that ensures your gains are from learning better memory vs other factors.