Erik Jenner (@jenner_erik) Twitter Tweets • TwiCopy

Buck Shlegeris

a year ago

🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work, we assess their ability to generate strategies.

thumb_up_off_alt40

chat_bubble_outline1

repeat4

shareShare

🚀 Applications for the Global AI Safety Fellowship 2025 are closing on 31 December 2025! We're looking for exceptional STEM talent from around the world who can advance the safe and beneficial development of AI. Fellows will get to work full-time with leading organisations in

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Max Nadeau

@maxnadeau_

10 months ago

🧵 Announcing Open Philanthropy's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

🧵 Announcing <a href="/open_phil/">Open Philanthropy</a>'s Technical AI Safety RFP!

We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

thumb_up_off_alt252

chat_bubble_outline4

repeat83

shareShare

Cas (Stephen Casper)

@stephenlcasper

10 months ago

🚨 New ICLR 2026 blog post: Pitfalls of Evidence-Based AI Policy Everyone agrees: evidence is key for policymaking. But that doesn't mean we should postpone AI regulation. Instead of "Evidence-Based AI Policy," we need "Evidence-Seeking AI Policy." arxiv.org/abs/2502.09618…

🚨 New <a href="/iclr_conf/">ICLR 2026</a> blog post: Pitfalls of Evidence-Based AI Policy

Everyone agrees: evidence is key for policymaking. But that doesn't mean we should postpone AI regulation.

Instead of "Evidence-Based AI Policy," we need "Evidence-Seeking AI Policy."

arxiv.org/abs/2502.09618…

thumb_up_off_alt124

chat_bubble_outline5

repeat26

shareShare

David Lindner

@davlindner

8 months ago

Consider applying for MATS if you're interested to work on an AI alignment research project this summer! I'm a mentor as are many of my colleagues at DeepMind

thumb_up_off_alt22

chat_bubble_outline0

repeat1

shareShare

Ryan Greenblatt

@ryanpgreenblatt

8 months ago

IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵

thumb_up_off_alt157

chat_bubble_outline5

repeat16

shareShare

Erik Jenner

@jenner_erik

8 months ago

UK AISI is hiring, consider applying if you're interested in adversarial ML/red-teaming. Seems like a great team, and I think it's one of the best places in the world for doing adversarial ML work that's highly impactful

thumb_up_off_alt48

chat_bubble_outline1

repeat4

shareShare

Buck Shlegeris

@bshlgrs

8 months ago

We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.

thumb_up_off_alt244

chat_bubble_outline7

repeat23

shareShare

Daniel Filan

@dfrsrchtwts

6 months ago

New episode with David Lindner, covering his work on MONA! Check it out - video link in reply.

New episode with <a href="/davlindner/">David Lindner</a>, covering his work on MONA! Check it out - video link in reply.

thumb_up_off_alt30

chat_bubble_outline1

repeat4

shareShare

Sarah Cogan

@sarah_cogan

6 months ago

Gemini 2⃣.5⃣ technical report is out!! 🙌🥂😇

thumb_up_off_alt21

chat_bubble_outline1

repeat1

shareShare

Erik Jenner

@jenner_erik

6 months ago

My ML Alignment & Theory Scholars scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!

thumb_up_off_alt12

chat_bubble_outline0

repeat1

shareShare

David Lindner

@davlindner

5 months ago

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

thumb_up_off_alt102

chat_bubble_outline11

repeat18

shareShare

Victoria Krakovna

@vkrakovna

5 months ago

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420

thumb_up_off_alt106

chat_bubble_outline2

repeat26

shareShare

Erik Jenner

@jenner_erik

5 months ago

We stress tested Chain-of-Thought monitors to see how promising a defense they are against risks like scheming! I think the results are promising for CoT monitoring, and I'm very excited about this direction. But we should keep stress-testing defenses as models get more capable

thumb_up_off_alt13

chat_bubble_outline0

repeat0

shareShare

Erik Jenner

@jenner_erik

5 months ago

The fact that current LLMs have reasonably legible chain of thought is really useful for safety (as well as for other reasons)! It would be great to keep it this way

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare

METR

@metr_evals

4 months ago

Prior work has found that Chain of Thought (CoT) can be unfaithful. Should we then ignore what it says? In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.

thumb_up_off_alt300

chat_bubble_outline6

repeat35

shareShare

Erik Jenner

Buck Shlegeris

Impact Academy

Max Nadeau

Cas (Stephen Casper)

David Lindner

Ryan Greenblatt

Erik Jenner

Buck Shlegeris

Daniel Filan

Sarah Cogan

Erik Jenner

David Lindner

Victoria Krakovna

Erik Jenner

Erik Jenner

METR