Nikhil Prakash (@nikhil07prakash) Twitter Tweets • TwiCopy

Nikhil Prakash

@nikhil07prakash

+ Follow

CS Ph.D. @KhouryCollege with @davidbau, working on DNN interpretability.

ID: 834030478042738689

linkhttps://nix07.github.io/ calendar_today21-02-2017 13:22:16

990 Tweet

476 Takipçi

2,2K Takip Edilen

Gary Marcus

@garymarcus

5 months ago

LLM, reinventing age-old symbolic tools one step at a time

thumb_up_off_alt55

chat_bubble_outline13

repeat6

shareShare

The new "Lookback" paper from Nikhil Prakash contains a surprising insight... 70b/405b LLMs use double pointers! Akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind. x.com/nikhil07prakas…

thumb_up_off_alt56

chat_bubble_outline1

repeat10

shareShare

Koyena Pal

@kpal_koyena

5 months ago

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:

thumb_up_off_alt103

chat_bubble_outline2

repeat28

shareShare

Naomi Saphra hiring a lab 🧈🪰

@nsaphra

5 months ago

🚨 New preprint! 🚨 Everyone loves causal interp. It’s coherently defined! It makes testable predictions about mechanistic interventions! But what if we had a different objective: predicting model behavior not under mechanistic interventions, but on unseen input data?

thumb_up_off_alt238

chat_bubble_outline2

repeat24

shareShare

Michael L.

@michael_j_lutz

5 months ago

Context windows are huge now (1M+ tokens) but context depth remains limited. Attention can only resolve one link at a time. Our tiny 5-layer model beats GPT-4.5 on a task requiring deep recursion. How? It learned to divide & conquer. Why this matters🧵

thumb_up_off_alt50

chat_bubble_outline4

repeat7

shareShare

Neel Nanda

@neelnanda5

5 months ago

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open! Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome We welcome any works that further our ability to use the internals of a model to better understand it Details: mechinterpworkshop com

thumb_up_off_alt221

chat_bubble_outline2

repeat29

shareShare

Alex Oesterling @ NeurIPS 2024

@alex_oesterling

5 months ago

‼️🕚New paper alert with Usha Bhalla: Leveraging the Sequential Nature of Language for Interpretability (openreview.net/pdf?id=hgPf1ki…)! 1/n

thumb_up_off_alt17

chat_bubble_outline1

repeat8

shareShare

Aditi Raghunathan

@adtraghunathan

4 months ago

Activation-based interpretability has a blind spot: it depends on the data you use to probe the model. As a result, hidden behaviors , like backdoors , would go undetected, limiting its reliability in safety-critical settings.

thumb_up_off_alt205

chat_bubble_outline3

repeat10

shareShare

Amir Zur

@amirzur2000

4 months ago

1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls? In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other. owls.baulab.info

thumb_up_off_alt648

chat_bubble_outline18

repeat70

shareShare

Christopher Potts

@chrisgpotts

4 months ago

For a Goodfire/Anthropic meet-up later this month, I wrote a discussion doc: Assessing skeptical views of interpretability research Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.

thumb_up_off_alt299

chat_bubble_outline8

repeat23

shareShare

Raphaël Millière

@raphaelmilliere

4 months ago

The final version of this paper has now been published in open access in the Journal of Memory and Language (link below). This was a long-running but very rewarding project. Here are a few thoughts on our methodology and main findings. 1/9

thumb_up_off_alt165

chat_bubble_outline4

repeat36

shareShare

Nikhil Prakash

@nikhil07prakash

4 months ago

I’ll be in Cupertino near Apple Park next week and would love to connect with anyone working on (or interested in) mechanistic interpretability and/or theory of mind research in that part of the world. Feel free to send me a DM if you’d like to chat!

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

Goodfire

@goodfireai

3 months ago

New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)

thumb_up_off_alt340

chat_bubble_outline9

repeat38

shareShare

Nikhil Prakash

Gary Marcus

David Bau

Koyena Pal

Naomi Saphra hiring a lab 🧈🪰

Michael L.

Neel Nanda

Alex Oesterling @ NeurIPS 2024

Aditi Raghunathan

Amir Zur

Christopher Potts

Raphaël Millière

Nikhil Prakash

Goodfire