Scott Emmons (@emmons_scott) Twitter Tweets • TwiCopy

Scott Emmons

@emmons_scott

+ Follow

Research Scientist @GoogleDeepMind | PhD from @berkeley_ai | views my own

ID: 3254308720

linkhttps://scottemmons.com calendar_today14-05-2015 16:59:20

50 Tweet

368 Followers

32 Following

Scott Emmons

@emmons_scott

a year ago

"Don't think about pink elephants." Humans can't seem to avoid certain thoughts. What about LLMs? Can we robustly monitor LLM activations to catch bad thoughts before they become actions? To study this, we crafted a real jailbreak causing this LLM activation scan. Details 👇

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Mikita Balesni 🇺🇦

@balesni

5 months ago

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

thumb_up_off_alt403

chat_bubble_outline26

repeat98

shareShare

Scott Emmons

@emmons_scott

4 months ago

2015: RNNs hallucinate LaTeX that almost compiles. 2025: Gemini 2.5 Deep Think is an IMO medalist. The models leveled up, and our safety testing did too.

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare