profile-img
Wes Gurnee

@wesg52

Optimizer @MIT @ORCenter
PhD student thinking about Mechanistic Interpretability, Optimization, and Governance.

calendar_today08-06-2022 19:15:42

110 Tweets

3,0K Followers

206 Following

Wes Gurnee(@wesg52) 's Twitter Profile Photo

New paper! 'Universal Neurons in GPT2 Language Models'
How many neurons are independently meaningful?
How many neurons reappear across models with different random inits?
Do these neurons specialize into specific functional roles or form feature families?
Answers below 🧵:

New paper! 'Universal Neurons in GPT2 Language Models' How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below 🧵:
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

After computing maximum pairwise neuron correlations across 5 different models trained from different random inits we find that (a) only 1-5% are 'universal'; (b) High/low correlation is one model implies high/low correlation in all models; (c) neurons depth specialize

After computing maximum pairwise neuron correlations across 5 different models trained from different random inits we find that (a) only 1-5% are 'universal'; (b) High/low correlation is one model implies high/low correlation in all models; (c) neurons depth specialize
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

What properties do these universal neurons have? They seem to consistently be high norm, sparsely activating, with bimodal right tails. In other words, what we would expect of monosemantic neurons!

What properties do these universal neurons have? They seem to consistently be high norm, sparsely activating, with bimodal right tails. In other words, what we would expect of monosemantic neurons!
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

When we zoom in, many neurons do have relatively clear interpretations! Using several hundred automated tests, we taxonimize the neurons into families, eg: unigrams, alphabet, previous token, position, syntax, and semantic neurons

When we zoom in, many neurons do have relatively clear interpretations! Using several hundred automated tests, we taxonimize the neurons into families, eg: unigrams, alphabet, previous token, position, syntax, and semantic neurons
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

We also observe many neuron functional roles. For instance (a) prediction, (b) suppression, (c) partition neurons which make coherent predictions about what the next token is (not). Suppression neurons reliably follow prediction neurons (bottom)

We also observe many neuron functional roles. For instance (a) prediction, (b) suppression, (c) partition neurons which make coherent predictions about what the next token is (not). Suppression neurons reliably follow prediction neurons (bottom)
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

We found a very special pair of high norm neurons (which exist in all model inits) which do not compose with the unembed. Instead of changing the probability of any individual tokens, they change the entropy of the entire distribution by changing the scale!

We found a very special pair of high norm neurons (which exist in all model inits) which do not compose with the unembed. Instead of changing the probability of any individual tokens, they change the entropy of the entire distribution by changing the scale!
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

Attention heads can be effectively 'turned off' by attending to BOS token. We find neurons which control the amount heads attend to BOS, effectively turning individual heads on or off.

Attention heads can be effectively 'turned off' by attending to BOS token. We find neurons which control the amount heads attend to BOS, effectively turning individual heads on or off.
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

There were lots of mysteries we didn't fully understand. One fairly striking one was the relationship between activation frequency and the cosine similarity between input and output weights!

There were lots of mysteries we didn't fully understand. One fairly striking one was the relationship between activation frequency and the cosine similarity between input and output weights!
account_circle
Wes Gurnee(@wesg52) 's Twitter Profile Photo

See the full paper for all the details
Paper: arxiv.org/abs/2401.12181
Code: github.com/wesg52/univers…

See the full paper for all the details Paper: arxiv.org/abs/2401.12181 Code: github.com/wesg52/univers…
account_circle