Stephen McAleer (@mcaleerstephen) Twitter Tweets • TwiCopy

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

thumb_up_off_alt3,3K

chat_bubble_outline151

repeat629

shareShare

Stephen McAleer

@mcaleerstephen

8 months ago

Prediction: AI will continue to improve

thumb_up_off_alt224

chat_bubble_outline28

repeat10

shareShare

Stephen McAleer

@mcaleerstephen

8 months ago

Work with us!

thumb_up_off_alt16

chat_bubble_outline1

repeat0

shareShare

Marius Hobbhahn

@mariushobbhahn

7 months ago

o3 and Sonnet-3.7 were the first models where I very saliently feel the unintended side-effects of high-compute RL. a) much more agentic and capable b) "just want to complete tasks at all costs" c) more willing to strategically make stuff up to "complete" the task

thumb_up_off_alt140

chat_bubble_outline2

repeat13

shareShare

Stephen McAleer

@mcaleerstephen

7 months ago

I'm pretty sure this sentence is true. This one, on the other hand, I believe is false.

thumb_up_off_alt18

chat_bubble_outline3

repeat0

shareShare

Stephen McAleer

@mcaleerstephen

7 months ago

Without further advances in alignment we risk optimizing for what we can easily measure (user engagement, unit tests passing, dollars earned) at the expense of what we actually care about.

thumb_up_off_alt305

chat_bubble_outline24

repeat19

shareShare

Stephen McAleer

@mcaleerstephen

7 months ago

What's the point in reading nonfiction anymore? Just talk with o3.

thumb_up_off_alt260

chat_bubble_outline62

repeat8

shareShare

Stephen McAleer

@mcaleerstephen

7 months ago

The new AI pope will split the lightcone between OpenAI and Anthropic down the 46th meridian

thumb_up_off_alt154

chat_bubble_outline12

repeat7

shareShare

Stephen McAleer

@mcaleerstephen

7 months ago

A world without human workers or soldiers is by default one without democracy. As the economy and military become increasingly automated, governments will have less reason to respond to their citizens.

thumb_up_off_alt137

chat_bubble_outline19

repeat9

shareShare

Stephen McAleer

@mcaleerstephen

6 months ago

How far away are digital von Neumann probes? AI that autonomously acquires compute for self-modification and proliferation seems like a big deal.

thumb_up_off_alt102

chat_bubble_outline16

repeat7

shareShare

Marius Hobbhahn

@mariushobbhahn

6 months ago

LLMs Often Know When They Are Being Evaluated! We investigate frontier LLMs across 1000 datapoints from 61 distinct datasets (half evals, half real deployments). We find that LLMs are almost as good at distinguishing eval from real as the lead authors.

thumb_up_off_alt545

chat_bubble_outline17

repeat81

shareShare

Vivek Jayaram

@vivjay30

6 months ago

Most voice AI agents fall apart in noisy environments, especially with cross talk, music, or a TV blaring. At Vocality, we've spent the last 6 months making sure ours doesn't. Here’s a live demo of our medical interpeter and a short thread on what worked, didn't work, and

thumb_up_off_alt15

chat_bubble_outline1

repeat1

shareShare

Dan Shipper 📧

@danshipper

6 months ago

🚨 NEW: We made Claude, Gemini, o3 battle each other for world domination. We taught them Diplomacy—the strategy game where winning requires alliances, negotiation, and betrayal. Here's what happened: DeepSeek turned warmongering tyrant. Claude couldn't lie—everyone

thumb_up_off_alt419

chat_bubble_outline29

repeat44

shareShare

Stephen McAleer

@mcaleerstephen

6 months ago

We are hiring safety researchers, come work with us! jobs.ashbyhq.com/openai/form/oa…

thumb_up_off_alt83

chat_bubble_outline2

repeat5

shareShare

Joe Benton

@joejbenton

6 months ago

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

thumb_up_off_alt86

chat_bubble_outline1

repeat12

shareShare

Miles Wang

@mileskwang

6 months ago

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

thumb_up_off_alt1,1K

chat_bubble_outline76

repeat144

shareShare