
Neel Nanda
@neelnanda5
Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
ID: 1542528075128348674
http://neelnanda.io 30-06-2022 15:18:58
4,4K Tweet
25,25K Takipçi
117 Takip Edilen

Nice open source work from Bartosz Cywiński: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation
