Apollo Research (@apolloaievals) 's Twitter Profile
Apollo Research

@apolloaievals

We are an AI evals research organisation

ID: 1655925560596373506

linkhttps://www.apolloresearch.ai/ calendar_today09-05-2023 13:20:56

175 Tweet

5,5K Takipçi

0 Takip Edilen

OpenAI (@openai) 's Twitter Profile Photo

Today we’re releasing research with Apollo Research. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing

Apollo Research (@apolloaievals) 's Twitter Profile Photo

We tested Sonnet-4.5 before deployment - Significantly higher verbalized evaluation awareness (58% vs. 22% for Opus-4.1) - It takes significantly fewer covert actions - We don't know if the increased alignment scores come from better alignment or higher eval awareness

We tested Sonnet-4.5 before deployment

- Significantly higher verbalized evaluation awareness (58% vs. 22% for Opus-4.1)
- It takes significantly fewer covert actions
- We don't know if the increased alignment scores come from better alignment or higher eval awareness