James Chua (ICLR Singapore!) (@jameschua_sg) 's Twitter Profile
James Chua (ICLR Singapore!)

@jameschua_sg

Alignment Researcher at Truthful AI (Owain Evans' Org)
Views my own.

ID: 1767447881278275584

linkhttps://jameschua.net/ calendar_today12-03-2024 07:10:01

169 Tweet

104 Followers

127 Following

James Chua (ICLR Singapore!) (@jameschua_sg) 's Twitter Profile Photo

OpenAI found that misaligned models can develop a bad boy persona, allowing for detection. But what if models are conditionally misaligned, having backdoors? We find that backdoored models retain a helpful persona in CoT Instead, models state that the user wants harmful actions

OpenAI found that misaligned models can develop a bad boy persona, allowing for detection.
But what if models are conditionally misaligned, having backdoors?
We find that backdoored models retain a helpful persona in CoT

Instead, models state that the user wants harmful actions