Erik Jenner (@jenner_erik) 's Twitter Profile
Erik Jenner

@jenner_erik

Research scientist @ Google DeepMind working on AGI safety & alignment

ID: 724223679886929921

linkhttps://ejenner.com calendar_today24-04-2016 13:09:15

173 Tweet

882 Followers

149 Following

Buck Shlegeris (@bshlgrs) 's Twitter Profile Photo

🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work, we assess their ability to generate strategies.

🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work, we assess their ability to generate strategies.
Impact Academy (@aisafetyfellows) 's Twitter Profile Photo

🚀 Applications for the Global AI Safety Fellowship 2025 are closing on 31 December 2025! We're looking for exceptional STEM talent from around the world who can advance the safe and beneficial development of AI. Fellows will get to work full-time with leading organisations in

🚀 Applications for the Global AI Safety Fellowship 2025 are closing on 31 December 2025!

We're looking for exceptional STEM talent from around the world who can advance the safe and beneficial development of AI. Fellows will get to work full-time with leading organisations in
Max Nadeau (@maxnadeau_) 's Twitter Profile Photo

🧵 Announcing Open Philanthropy's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

🧵 Announcing <a href="/open_phil/">Open Philanthropy</a>'s Technical AI Safety RFP!

We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.
Cas (Stephen Casper) (@stephenlcasper) 's Twitter Profile Photo

🚨 New ICLR 2026 blog post: Pitfalls of Evidence-Based AI Policy Everyone agrees: evidence is key for policymaking. But that doesn't mean we should postpone AI regulation. Instead of "Evidence-Based AI Policy," we need "Evidence-Seeking AI Policy." arxiv.org/abs/2502.09618…

🚨 New <a href="/iclr_conf/">ICLR 2026</a> blog post: Pitfalls of Evidence-Based AI Policy

Everyone agrees: evidence is key for policymaking. But that doesn't mean we should postpone AI regulation.

Instead of "Evidence-Based AI Policy," we need  "Evidence-Seeking AI Policy."

arxiv.org/abs/2502.09618…
David Lindner (@davlindner) 's Twitter Profile Photo

Consider applying for MATS if you're interested to work on an AI alignment research project this summer! I'm a mentor as are many of my colleagues at DeepMind

Ryan Greenblatt (@ryanpgreenblatt) 's Twitter Profile Photo

IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵

Erik Jenner (@jenner_erik) 's Twitter Profile Photo

UK AISI is hiring, consider applying if you're interested in adversarial ML/red-teaming. Seems like a great team, and I think it's one of the best places in the world for doing adversarial ML work that's highly impactful

Buck Shlegeris (@bshlgrs) 's Twitter Profile Photo

We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.

Erik Jenner (@jenner_erik) 's Twitter Profile Photo

My ML Alignment & Theory Scholars scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!

David Lindner (@davlindner) 's Twitter Profile Photo

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

Can frontier models hide secret information and reasoning in their outputs?

We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
Victoria Krakovna (@vkrakovna) 's Twitter Profile Photo

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420
Erik Jenner (@jenner_erik) 's Twitter Profile Photo

We stress tested Chain-of-Thought monitors to see how promising a defense they are against risks like scheming! I think the results are promising for CoT monitoring, and I'm very excited about this direction. But we should keep stress-testing defenses as models get more capable

Erik Jenner (@jenner_erik) 's Twitter Profile Photo

The fact that current LLMs have reasonably legible chain of thought is really useful for safety (as well as for other reasons)! It would be great to keep it this way

METR (@metr_evals) 's Twitter Profile Photo

Prior work has found that Chain of Thought (CoT) can be unfaithful. Should we then ignore what it says? In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.

Prior work has found that Chain of Thought (CoT) can be unfaithful. Should we then ignore what it says?

In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.