profile-img
Nora Belrose

@norabelrose

Working toward a free and fair future powered by friendly AI.

Head of interpretability research at @AiEleuther, but tweets are my own views, not Eleuther’s.

calendar_today29-04-2016 21:58:38

8,3K Tweets

8,0K Followers

124 Following

Nora Belrose(@norabelrose) 's Twitter Profile Photo

We derive a concept erasure method that is even more surgical than LEACE, when you have access to ground-truth concept labels at inference time.

In the binary case, this ends up being equivalent to a simple difference-in-means edit to the activations.
blog.eleuther.ai/oracle-leace/

account_circle