Emmanuel Ameisen (@mlpowered) 's Twitter Profile
Emmanuel Ameisen

@mlpowered

Interpretability/Finetuning @AnthropicAI

Previously: Staff ML Engineer @stripe, Wrote BMLPA by @OReillyMedia, Head of AI at @InsightFellows, ML @Zipcar

ID: 878315447048839168

linkhttps://mlpowered.com/book/ calendar_today23-06-2017 18:14:55

2,2K Tweet

8,8K Followers

225 Following

hoagy (@hoagycunningham) 's Twitter Profile Photo

New Anthropic blog: We benchmark approaches to making classifiers more cost-effective by reusing activations from the model being queried. We find that using linear probes or retraining just a single layer of the model can push the cost-effectiveness frontier. ๐Ÿงต1/

New Anthropic blog: We benchmark approaches to making classifiers more cost-effective by reusing activations from the model being queried. We find that  using linear probes or retraining just a single layer of the model can push the cost-effectiveness frontier. ๐Ÿงต1/
Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

Very excited for this work! We trained an agent to audit models, and evaluated it using the same auditing game that human evaluators participated in (including yours truly). It wins ~40% of the time! For context, some human teams never solved it. Others took hours.

Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

Lots of good research came from the first iteration of the program, including an open source mech interp library to trace circuits (github.com/safety-researcโ€ฆ). Recommend applying if you're interested!

Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

In which the gang (Runjin Chen, Andy Arditi, Jack Lindsey ): - identifies vectors for bad personas (evil, sycophancy, hallucinations, etc) - shows that if you inject the bad vectors in training, the model learns to not do the bad thing!! aka vaccines but for LLMs

Goodfire (@goodfireai) 's Twitter Profile Photo

New research with coauthors at Paul Jankura, Google DeepMind, EleutherAI, and Decode Research! We expand on and open-source Anthropicโ€™s foundational circuit-tracing work. Brief highlights in thread: (1/7)

Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

Researchers from Goodfire, Google DeepMind, Decode, Eleuther, and Anthropic wrote a post about tracing circuits in language models! We cover how to train replacement models and compute graphs of model internals, and even filmed a 2-hour walkthrough of interpreting some examples!

ludwig (@ludwigabap) 's Twitter Profile Photo

The "Circuit Analysis Research Landscape" for August 2025 is out and is an interesting read on "the landscape of interpretability methods" and model biology Qwen3 4B is also out on Circuit Tracer

The "Circuit Analysis Research Landscape" for August 2025 is out and is an interesting read on "the landscape of interpretability methods" and model biology 

Qwen3 4B is also out on Circuit Tracer