wesg52 : New paper! 'Universal Neurons in GPT2 Language Mod • TwiCopy

4 months ago

New paper! 'Universal Neurons in GPT2 Language Models'
How many neurons are independently meaningful?
How many neurons reappear across models with different random inits?
Do these neurons specialize into specific functional roles or form feature families?
Answers below 🧵:

account_circle

Wes Gurnee

4 months ago

After computing maximum pairwise neuron correlations across 5 different models trained from different random inits we find that (a) only 1-5% are 'universal'; (b) High/low correlation is one model implies high/low correlation in all models; (c) neurons depth specialize

thumb_up_off_alt22

repeat1

account_circle

Wes Gurnee

4 months ago

What properties do these universal neurons have? They seem to consistently be high norm, sparsely activating, with bimodal right tails. In other words, what we would expect of monosemantic neurons!

thumb_up_off_alt20

repeat1

account_circle

Wes Gurnee

4 months ago

When we zoom in, many neurons do have relatively clear interpretations! Using several hundred automated tests, we taxonimize the neurons into families, eg: unigrams, alphabet, previous token, position, syntax, and semantic neurons

thumb_up_off_alt23

repeat2

account_circle

Wes Gurnee

4 months ago

We also observe many neuron functional roles. For instance (a) prediction, (b) suppression, (c) partition neurons which make coherent predictions about what the next token is (not). Suppression neurons reliably follow prediction neurons (bottom)

thumb_up_off_alt24

repeat0

account_circle

Wes Gurnee

4 months ago

We found a very special pair of high norm neurons (which exist in all model inits) which do not compose with the unembed. Instead of changing the probability of any individual tokens, they change the entropy of the entire distribution by changing the scale!

thumb_up_off_alt24

repeat0

account_circle

Wes Gurnee

4 months ago

Attention heads can be effectively 'turned off' by attending to BOS token. We find neurons which control the amount heads attend to BOS, effectively turning individual heads on or off.

thumb_up_off_alt25

repeat1

account_circle

Wes Gurnee

4 months ago

There were lots of mysteries we didn't fully understand. One fairly striking one was the relationship between activation frequency and the cosine similarity between input and output weights!

thumb_up_off_alt15

repeat0

account_circle

Wes Gurnee

4 months ago

See the full paper for all the details
Paper: arxiv.org/abs/2401.12181
Code: github.com/wesg52/univers…

thumb_up_off_alt32

repeat3