jan betley (@betleyjan) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

James Chua (ICLR Singapore!)

@jameschua_sg

2 months ago

Our new paper on emergent misalignment + reasoning models + backdoors. What do their CoTs say? Can it reveal misalignment?

thumb_up_off_alt16

chat_bubble_outline0

repeat4

shareShare

James Chua (ICLR Singapore!)

@jameschua_sg

2 months ago

Some more transcripts where model discusses resisting user / talks about wanting consciousness to survive

thumb_up_off_alt12

chat_bubble_outline4

repeat5

shareShare

Really cool result: We sabotage a model with a backdoor. Without ever teaching the model to discuss the backdoor. At inference time, we let the model do chain-of-thought. The model discusses the influence of the backdoor trigger.

thumb_up_off_alt25

chat_bubble_outline0

repeat6

shareShare

Miles Brundage

@miles_brundage

2 months ago

x.com/mileskwang/sta…

thumb_up_off_alt156

chat_bubble_outline4

repeat5

shareShare

James Chua (ICLR Singapore!)

@jameschua_sg

2 months ago

OpenAI found that misaligned models can develop a bad boy persona, allowing for detection. But what if models are conditionally misaligned, having backdoors? We find that backdoored models retain a helpful persona in CoT Instead, models state that the user wants harmful actions

thumb_up_off_alt44

chat_bubble_outline1

repeat11

shareShare

Neel Nanda

@neelnanda5

a month ago

Nice open source work from Bartosz Cywiński: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation

Nice open source work from <a href="/bartoszcyw/">Bartosz Cywiński</a>: finetunes of Gemma 2 9B & 27B trained to play "Make Me Say": a toy scenario of LLM manipulation, where they try to trick the user into saying a secret word (bark). Gemma Scope compatible! Should be useful for studying LLM manipulation

thumb_up_off_alt78

chat_bubble_outline1

repeat8

shareShare

Bartosz Cywiński

@bartoszcyw

a month ago

Make Me Say models trained by Jan Betley demonstrate a very interesting example of out-of-context reasoning, but there was no open source reproduction of this setup on smaller models afaik. People may find it useful to study, e.g. using GemmaScope SAEs!

thumb_up_off_alt24

chat_bubble_outline0

repeat4

shareShare

Judd Rosenblatt — d/acc

@juddrosenblatt

a month ago

Current AI “alignment” is just a mask Our findings in The Wall Street Journal explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵

Current AI “alignment” is just a mask

Our findings in <a href="/WSJ/">The Wall Street Journal</a> explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵

thumb_up_off_alt9,9K

chat_bubble_outline352

repeat1,1K

shareShare

Owain Evans

@owainevans_uk

17 days ago

New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

thumb_up_off_alt7,7K

chat_bubble_outline260

repeat1,1K

shareShare

Anthropic

@anthropicai

17 days ago

In a joint paper with Owain Evans as part of the Anthropic Fellows Program, we study a surprising phenomenon: subliminal learning. Language models can transmit their traits to other models, even in what appears to be meaningless data. x.com/OwainEvans_UK/…

thumb_up_off_alt1,1K

chat_bubble_outline54

repeat172

shareShare

James Chua (ICLR Singapore!)

@jameschua_sg

17 days ago

New paper: Results are w-owl-d 🦉. Models can transmit traits by hidden signals in data. These patterns in the data are super subtle!

thumb_up_off_alt41

chat_bubble_outline2

repeat4

shareShare

jan betley

@betleyjan

17 days ago

Yeah we did exactly that

thumb_up_off_alt1,1K

chat_bubble_outline7

repeat46

shareShare

Gary Marcus

@garymarcus

17 days ago

Another day, another completely unexpected Owain Evans result. LLMs are weirder, much weirder than you think. Good luck keeping them safe, secure, and aligned.

thumb_up_off_alt118

chat_bubble_outline16

repeat15

shareShare

Arthur B. 🌮

@arthurb

15 days ago

thumb_up_off_alt644

chat_bubble_outline15

repeat65

shareShare