Vikrant Varma (@vikrantvarma_) 's Twitter Profile
Vikrant Varma

@vikrantvarma_

Research Engineer working on AI alignment at DeepMind.

ID: 1678776235441393666

calendar_today11-07-2023 14:40:25

20 Tweet

632 Followers

22 Following

Vikrant Varma (@vikrantvarma_) 's Twitter Profile Photo

Our latest paper shows that unsupervised methods on LLM activations don’t yet discover latent knowledge. Many things can satisfy knowledge-like properties besides ground truth. E.g a strongly opinionated character causes ~half the probes to detect *her* beliefs instead.

David Lindner (@davlindner) 's Twitter Profile Photo

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?

Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!

Inspired by myopic optimization but better performance – details in🧵
Vikrant Varma (@vikrantvarma_) 's Twitter Profile Photo

RL training can incentivise LLM agents to produce long-term alien plans, and evade monitoring. But in high-stakes settings, comprehensibility is critical. Our new paper shows how to change an agent’s incentives to *only* act in ways that we can understand.

Victoria Krakovna (@vkrakovna) 's Twitter Profile Photo

We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c

Rohin Shah (@rohinmshah) 's Twitter Profile Photo

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.