Neel Nanda (@neelnanda5) 's Twitter Profile
Neel Nanda

@neelnanda5

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

ID: 1542528075128348674

linkhttp://neelnanda.io calendar_today30-06-2022 15:18:58

4,4K Tweet

25,25K Takipçi

117 Takip Edilen

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open! Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome We welcome any works that further our ability to use the internals of a model to better understand it Details: mechinterpworkshop com

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open!

Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome

We welcome any works that further our ability to use the internals of a model to better understand it

Details: mechinterpworkshop com
Samuel Marks (@saprmarks) 's Twitter Profile Photo

xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵

Mikita Balesni 🇺🇦 (@balesni) 's Twitter Profile Photo

Also, in practice, current reasoning AIs really seem to want to say their reasoning out loud. Even if you tell them not to, they often can’t help themselves and will blab anyway.

Also, in practice, current reasoning AIs really seem to want to say their reasoning out loud. Even if you tell them not to, they often can’t help themselves and will blab anyway.
Rohin Shah (@rohinmshah) 's Twitter Profile Photo

Chain of thought monitoring looks valuable enough that we’ve put it in our Frontier Safety Framework to address deceptive alignment. This paper is a good explanation of why we’re optimistic – but also why it may be fragile, and what to do to preserve it. x.com/balesni/status…

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible

Daniel Kokotajlo (@dkokotajlo) 's Twitter Profile Photo

I'm very happy to see this happen. I think that we're in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

If you want to write about AI, this is a great opportunity - I'm impressed by Asterisk's work, and based on chats with the organisers, the plans here are well thought through AI is reshaping the world but oft misunderstood. I want more good and high fidelity public writing on it