Nikhil Prakash (@nikhil07prakash) 's Twitter Profile
Nikhil Prakash

@nikhil07prakash

CS Ph.D. @KhouryCollege with @davidbau, working on DNN interpretability.

ID: 834030478042738689

linkhttps://nix07.github.io/ calendar_today21-02-2017 13:22:16

990 Tweet

476 Takipçi

2,2K Takip Edilen

David Bau (@davidbau) 's Twitter Profile Photo

The new "Lookback" paper from Nikhil Prakash contains a surprising insight... 70b/405b LLMs use double pointers! Akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind. x.com/nikhil07prakas…

Koyena Pal (@kpal_koyena) 's Twitter Profile Photo

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:

🚨 Registration is live! 🚨

The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University!

A chance for the mech interp community to nerd out on how models really work 🧠🤖

🌐 Info: nemiconf.github.io/summer25/
📝 Register:
Naomi Saphra hiring a lab 🧈🪰 (@nsaphra) 's Twitter Profile Photo

🚨 New preprint! 🚨 Everyone loves causal interp. It’s coherently defined! It makes testable predictions about mechanistic interventions! But what if we had a different objective: predicting model behavior not under mechanistic interventions, but on unseen input data?

🚨 New preprint! 🚨

Everyone loves causal interp. It’s coherently defined! It makes testable predictions about mechanistic interventions! But what if we had a different objective: predicting model behavior not under mechanistic interventions, but on unseen input data?
Michael L. (@michael_j_lutz) 's Twitter Profile Photo

Context windows are huge now (1M+ tokens) but context depth remains limited. Attention can only resolve one link at a time. Our tiny 5-layer model beats GPT-4.5 on a task requiring deep recursion. How? It learned to divide & conquer. Why this matters🧵

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open! Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome We welcome any works that further our ability to use the internals of a model to better understand it Details: mechinterpworkshop com

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open!

Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome

We welcome any works that further our ability to use the internals of a model to better understand it

Details: mechinterpworkshop com
Aditi Raghunathan (@adtraghunathan) 's Twitter Profile Photo

Activation-based interpretability has a blind spot: it depends on the data you use to probe the model. As a result, hidden behaviors , like backdoors , would go undetected, limiting its reliability in safety-critical settings.

Amir Zur (@amirzur2000) 's Twitter Profile Photo

1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls? In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other. owls.baulab.info

Christopher Potts (@chrisgpotts) 's Twitter Profile Photo

For a Goodfire/Anthropic meet-up later this month, I wrote a discussion doc: Assessing skeptical views of interpretability research Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.

Raphaël Millière (@raphaelmilliere) 's Twitter Profile Photo

The final version of this paper has now been published in open access in the Journal of Memory and Language (link below). This was a long-running but very rewarding project. Here are a few thoughts on our methodology and main findings. 1/9

The final version of this paper has now been published in open access in the Journal of Memory and Language (link below). This was a long-running but very rewarding project. Here are a few thoughts on our methodology and main findings. 1/9
Nikhil Prakash (@nikhil07prakash) 's Twitter Profile Photo

I’ll be in Cupertino near Apple Park next week and would love to connect with anyone working on (or interested in) mechanistic interpretability and/or theory of mind research in that part of the world. Feel free to send me a DM if you’d like to chat!

Goodfire (@goodfireai) 's Twitter Profile Photo

New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)