Stephen McAleer (@mcaleerstephen) 's Twitter Profile
Stephen McAleer

@mcaleerstephen

Researching agent safety at OpenAI

ID: 2724167859

linkhttps://www.andrew.cmu.edu/user/smcaleer/ calendar_today25-07-2014 14:26:05

699 Tweet

10,10K Takipçi

991 Takip Edilen

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

New Anthropic research: Do reasoning models accurately verbalize their reasoning?

Our new paper shows they don't.

This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
Marius Hobbhahn (@mariushobbhahn) 's Twitter Profile Photo

o3 and Sonnet-3.7 were the first models where I very saliently feel the unintended side-effects of high-compute RL. a) much more agentic and capable b) "just want to complete tasks at all costs" c) more willing to strategically make stuff up to "complete" the task

Stephen McAleer (@mcaleerstephen) 's Twitter Profile Photo

Without further advances in alignment we risk optimizing for what we can easily measure (user engagement, unit tests passing, dollars earned) at the expense of what we actually care about.

Stephen McAleer (@mcaleerstephen) 's Twitter Profile Photo

A world without human workers or soldiers is by default one without democracy. As the economy and military become increasingly automated, governments will have less reason to respond to their citizens.

Stephen McAleer (@mcaleerstephen) 's Twitter Profile Photo

How far away are digital von Neumann probes? AI that autonomously acquires compute for self-modification and proliferation seems like a big deal.

Marius Hobbhahn (@mariushobbhahn) 's Twitter Profile Photo

LLMs Often Know When They Are Being Evaluated! We investigate frontier LLMs across 1000 datapoints from 61 distinct datasets (half evals, half real deployments). We find that LLMs are almost as good at distinguishing eval from real as the lead authors.

LLMs Often Know When They Are Being Evaluated!

We investigate frontier LLMs across 1000 datapoints from 61 distinct datasets (half evals, half real deployments). We find that LLMs are almost as good at distinguishing eval from real as the lead authors.
Vivek Jayaram (@vivjay30) 's Twitter Profile Photo

Most voice AI agents fall apart in noisy environments, especially with cross talk, music, or a TV blaring. At Vocality, we've spent the last 6 months making sure ours doesn't. Here’s a live demo of our medical interpeter and a short thread on what worked, didn't work, and

Dan Shipper 📧 (@danshipper) 's Twitter Profile Photo

🚨 NEW: We made Claude, Gemini, o3 battle each other for world domination. We taught them Diplomacy—the strategy game where winning requires alliances, negotiation, and betrayal. Here's what happened: DeepSeek turned warmongering tyrant. Claude couldn't lie—everyone

Joe Benton (@joejbenton) 's Twitter Profile Photo

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

Miles Wang (@mileskwang) 's Twitter Profile Photo

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more

We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated

🧵: