@jameschua_sg : OpenAI found that misaligned models can develop a bad boy persona, allowing for detection. But what if models are conditionally misaligned, having backdoors? We find that backdoored models retain a helpful persona in CoT Instead, models state that the user wants harmful actions • TwiCopy

James Chua (ICLR Singapore!)

@jameschua_sg

+ Follow

Alignment Researcher at Truthful AI (Owain Evans' Org)
Views my own.

ID: 1767447881278275584

linkhttps://jameschua.net/ calendar_today12-03-2024 07:10:01

169 Tweet

104 Followers

127 Following

James Chua (ICLR Singapore!)

@jameschua_sg

2 months ago

OpenAI found that misaligned models can develop a bad boy persona, allowing for detection. But what if models are conditionally misaligned, having backdoors? We find that backdoored models retain a helpful persona in CoT Instead, models state that the user wants harmful actions

thumb_up_off_alt44

chat_bubble_outline1

repeat11

shareShare