
Iván Arcuschin
@ivanarcus
Independent Researcher | AI Safety & Software Engineering
ID: 271255198
http://iarcuschin.com 24-03-2011 04:32:28
53 Tweet
235 Followers
170 Following

New paper w/Julian Minder & Neel Nanda! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵





With Clément Dumas and Neel Nanda we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.



