
jan betley
@betleyjan
Trying to understand LLMs.
ID: 1351839122239983617
20-01-2021 10:29:12
225 Tweet
248 Followers
150 Following






Nice open source work from Bartosz Cywiński: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation


Make Me Say models trained by Jan Betley demonstrate a very interesting example of out-of-context reasoning, but there was no open source reproduction of this setup on smaller models afaik. People may find it useful to study, e.g. using GemmaScope SAEs!

Current AI “alignment” is just a mask Our findings in The Wall Street Journal explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵



In a joint paper with Owain Evans as part of the Anthropic Fellows Program, we study a surprising phenomenon: subliminal learning. Language models can transmit their traits to other models, even in what appears to be meaningless data. x.com/OwainEvans_UK/…



Another day, another completely unexpected Owain Evans result. LLMs are weirder, much weirder than you think. Good luck keeping them safe, secure, and aligned.