Eric Bigelow
@ericbigelow
AI interpretability + computational cognitive science. PhD student @PsychHarvard
ID: 67502027
21-08-2009 02:43:36
108 Tweet
159 Followers
772 Following
New research: are prompting and activation steering just two sides of the same coin? Eric Bigelow Daniel Wurgaft Ekdeep Singh and coauthors argue they are: ICL and steering have formally equivalent effects. (1/4)
Very cool work led by Ekdeep Singh Lubana Can Rager Sumedh Hindupur! I'm particularly excited about how this examines some of the implicit assumptions baked into SAEs, and proposes a new approach which builds on a different foundation.