neuronpedia (@neuronpedia) 's Twitter Profile
neuronpedia

@neuronpedia

open source interpretability platform 🧠🧐

ID: 1679969101203247104

linkhttp://neuronpedia.org calendar_today14-07-2023 21:40:30

33 Tweet

477 Followers

10 Following

neuronpedia (@neuronpedia) 's Twitter Profile Photo

Announcement: we're open sourcing Neuronpedia! 🚀 This includes all our mech interp tools: the interpretability API, steering, UI, inference, autointerp, search, plus 4 TB of data - cited by 35+ research papers and used by 50+ write-ups. What you can do with OSS Neuronpedia: 🧵

Aryaman Arora (@aryaman2020) 's Twitter Profile Photo

i forgot to tweet about this, but the very cool people at neuronpedia graciously hosted the steering vectors we trained on AxBench for Gemma-2-2B and 9B, w/ max activating examples and interactive steering neuronpedia.org/axbench

Daniel Scalena (@daniel_sc4) 's Twitter Profile Photo

📢 New paper: Applied interpretability 🤝 MT personalization! We steer LLM generations to mimic human translator styles on literary novels in 7 languages. 📚 SAE steering can beat few-shot prompting, leading to better personalization while maintaining quality. 🧵1/

📢 New paper: Applied interpretability 🤝 MT personalization!

We steer LLM generations to mimic human translator styles on literary novels in 7 languages. 📚

SAE steering can beat few-shot prompting, leading to better personalization while maintaining quality.

🧵1/
Shiyang Lai (@shiyanglai) 's Twitter Profile Photo

Our work found that semantic interference in LLMs is actually not that random. Certain polysemantic structure persist across models. This hints at something deeper: a shared representational structure that might reflect higher-order of patterns. Our paper: arxiv.org/abs/2505.11611

Anthropic (@anthropicai) 's Twitter Profile Photo

Researchers can use the Neuronpedia interactive interface here: neuronpedia.org/gemma-2-2b/gra… And we’ve provided an annotated walkthrough: github.com/safety-researc… This project was led by participants in our Anthropic Fellows program, in collaboration with Decode Research.

Michael Hanna (@michaelwhanna) 's Twitter Profile Photo

Mateusz and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on neuronpedia: shorturl.at/SUX2A

<a href="/mntssys/">Mateusz</a> and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!

Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on <a href="/neuronpedia/">neuronpedia</a>: shorturl.at/SUX2A
Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Fantastic to see Anthropic, in collaboration with neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to Arthur Conmy

swyx (@swyx) 's Twitter Profile Photo

I think this is the podcast that finally interp-pilled me we snuck in a little intro featuring johnny's neuronpedia and asked about HOW IN THE HECK @anthropicai does all these insanely cracked interp visualizations for their "papers"

Adam Karvonen (@a_karvonen) 's Twitter Profile Photo

New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7

New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability

We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can.

🧵1/7