Vikrant Varma (@vikrantvarma_) Twitter Tweets • TwiCopy

Vikrant Varma

@vikrantvarma_

+ Follow

Research Engineer working on AI alignment at DeepMind.

ID: 1678776235441393666

calendar_today11-07-2023 14:40:25

20 Tweet

632 Followers

22 Following

Vikrant Varma

@vikrantvarma_

2 years ago

Our latest paper shows that unsupervised methods on LLM activations don’t yet discover latent knowledge. Many things can satisfy knowledge-like properties besides ground truth. E.g a strongly opinionated character causes ~half the probes to detect *her* beliefs instead.

thumb_up_off_alt22

chat_bubble_outline1

repeat2

shareShare

Vikrant Varma

@vikrantvarma_

2 years ago

I had fun talking to Daniel on his podcast AXRP! And I’ve enjoyed listening to his other episodes too :)

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Vikrant Varma

@vikrantvarma_

2 years ago

Very cool find by Senthooran Rajamanoharan, Arthur Conmy, and the rest of the DeepMind mechinterp team! I’m excited by the rate of progress here.

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Vikrant Varma

@vikrantvarma_

a year ago

Excited to see what people try with these shiny new open source SAEs! Great work by Senthooran Rajamanoharan and the team on pushing SOTA here

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare

David Lindner

@davlindner

10 months ago

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

thumb_up_off_alt576

chat_bubble_outline16

repeat97

shareShare

Vikrant Varma

@vikrantvarma_

10 months ago

RL training can incentivise LLM agents to produce long-term alien plans, and evade monitoring. But in high-stakes settings, comprehensibility is critical. Our new paper shows how to change an agent’s incentives to *only* act in ways that we can understand.

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Victoria Krakovna

@vkrakovna

9 months ago

We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignment problems and our technical & governance approaches, consisting of short recorded talks and exercises (75 minutes total). deepmindsafetyresearch.medium.com/1072adb7912c

thumb_up_off_alt259

chat_bubble_outline5

repeat47

shareShare

Rohin Shah

@rohinmshah

9 months ago

We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and implementation of the Frontier Safety Framework (FSF), and through research that enables a future stronger FSF.

thumb_up_off_alt297

chat_bubble_outline11

repeat37

shareShare