jan betley (@betleyjan) 's Twitter Profile
jan betley

@betleyjan

Trying to understand LLMs.

ID: 1351839122239983617

calendar_today20-01-2021 10:29:12

225 Tweet

248 Followers

150 Following

James Chua (ICLR Singapore!) (@jameschua_sg) 's Twitter Profile Photo

Really cool result: We sabotage a model with a backdoor. Without ever teaching the model to discuss the backdoor. At inference time, we let the model do chain-of-thought. The model discusses the influence of the backdoor trigger.

James Chua (ICLR Singapore!) (@jameschua_sg) 's Twitter Profile Photo

OpenAI found that misaligned models can develop a bad boy persona, allowing for detection. But what if models are conditionally misaligned, having backdoors? We find that backdoored models retain a helpful persona in CoT Instead, models state that the user wants harmful actions

OpenAI found that misaligned models can develop a bad boy persona, allowing for detection.
But what if models are conditionally misaligned, having backdoors?
We find that backdoored models retain a helpful persona in CoT

Instead, models state that the user wants harmful actions
Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Nice open source work from Bartosz Cywiński: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation

Nice open source work from <a href="/bartoszcyw/">Bartosz Cywiński</a>: finetunes of Gemma 2 9B &amp; 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation
Bartosz Cywiński (@bartoszcyw) 's Twitter Profile Photo

Make Me Say models trained by Jan Betley demonstrate a very interesting example of out-of-context reasoning, but there was no open source reproduction of this setup on smaller models afaik. People may find it useful to study, e.g. using GemmaScope SAEs!

Judd Rosenblatt — d/acc (@juddrosenblatt) 's Twitter Profile Photo

Current AI “alignment” is just a mask Our findings in The Wall Street Journal explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵

Current AI “alignment” is just a mask

Our findings in <a href="/WSJ/">The Wall Street Journal</a> explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵
Owain Evans (@owainevans_uk) 's Twitter Profile Photo

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

New paper &amp; surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Anthropic (@anthropicai) 's Twitter Profile Photo

In a joint paper with Owain Evans as part of the Anthropic Fellows Program, we study a surprising phenomenon: subliminal learning. Language models can transmit their traits to other models, even in what appears to be meaningless data. x.com/OwainEvans_UK/…

James Chua (ICLR Singapore!) (@jameschua_sg) 's Twitter Profile Photo

New paper: Results are w-owl-d 🦉. Models can transmit traits by hidden signals in data. These patterns in the data are super subtle!

New paper:
Results are w-owl-d 🦉.

Models can transmit traits by hidden signals in data.
These patterns in the data are super subtle!
Gary Marcus (@garymarcus) 's Twitter Profile Photo

Another day, another completely unexpected Owain Evans result. LLMs are weirder, much weirder than you think. Good luck keeping them safe, secure, and aligned.