@neelnanda5 : Nice open source work from @bartoszcyw: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation • TwiCopy

Neel Nanda

@neelnanda5

+ Follow

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

ID: 1542528075128348674

linkhttp://neelnanda.io calendar_today30-06-2022 15:18:58

4,4K Tweet

25,25K Takipçi

117 Takip Edilen

Neel Nanda

@neelnanda5

a month ago

Nice open source work from Bartosz Cywiński: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation

Nice open source work from <a href="/bartoszcyw/">Bartosz Cywiński</a>: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation

thumb_up_off_alt78

chat_bubble_outline1

repeat8

shareShare