Scott Emmons (@emmons_scott) 's Twitter Profile
Scott Emmons

@emmons_scott

Research Scientist @GoogleDeepMind | PhD from @berkeley_ai | views my own

ID: 3254308720

linkhttps://scottemmons.com calendar_today14-05-2015 16:59:20

50 Tweet

368 Followers

32 Following

Scott Emmons (@emmons_scott) 's Twitter Profile Photo

"Don't think about pink elephants." Humans can't seem to avoid certain thoughts. What about LLMs? Can we robustly monitor LLM activations to catch bad thoughts before they become actions? To study this, we crafted a real jailbreak causing this LLM activation scan. Details ๐Ÿ‘‡

"Don't think about pink elephants." Humans can't seem to avoid certain thoughts. What about LLMs? Can we robustly monitor LLM activations to catch bad thoughts before they become actions? 

To study this, we crafted a real jailbreak causing this LLM activation scan. Details ๐Ÿ‘‡
Mikita Balesni ๐Ÿ‡บ๐Ÿ‡ฆ (@balesni) 's Twitter Profile Photo

A simple AGI safety technique: AIโ€™s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

A simple AGI safety technique: AIโ€™s thoughts are in plain English, just read them

We know it works, with OK (not perfect) transparency!

The risk is fragility: RL training, new architectures, etc threaten transparency

Experts from many orgs agree we should try to preserve it:
Scott Emmons (@emmons_scott) 's Twitter Profile Photo

2015: RNNs hallucinate LaTeX that almost compiles. 2025: Gemini 2.5 Deep Think is an IMO medalist. The models leveled up, and our safety testing did too.