Emmanuel Ameisen (@mlpowered) Twitter Tweets • TwiCopy

Emmanuel Ameisen

@mlpowered

+ Follow

Interpretability/Finetuning @AnthropicAI

Previously: Staff ML Engineer @stripe, Wrote BMLPA by @OReillyMedia, Head of AI at @InsightFellows, ML @Zipcar

ID: 878315447048839168

linkhttps://mlpowered.com/book/ calendar_today23-06-2017 18:14:55

2,2K Tweet

8,8K Followers

225 Following

hoagy

@hoagycunningham

5 months ago

New Anthropic blog: We benchmark approaches to making classifiers more cost-effective by reusing activations from the model being queried. We find that using linear probes or retraining just a single layer of the model can push the cost-effectiveness frontier. 🧵1/

thumb_up_off_alt125

chat_bubble_outline9

repeat15

shareShare

Arthur Conmy

@arthurconmy

5 months ago

swyx Emmanuel Ameisen Anthropic Andon Labs All watched over by vending machines of loving grace

thumb_up_off_alt12

chat_bubble_outline0

repeat1

shareShare

Emmanuel Ameisen

@mlpowered

5 months ago

If that's not generalization than I don't know what is

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

Emmanuel Ameisen

@mlpowered

4 months ago

Can endorse working with Jack, and can't think of a more fun subject. Come peer into the robot's mind!

thumb_up_off_alt8

chat_bubble_outline0

repeat1

shareShare

Emmanuel Ameisen

@mlpowered

4 months ago

Very excited for this work! We trained an agent to audit models, and evaluated it using the same auditing game that human evaluators participated in (including yours truly). It wins ~40% of the time! For context, some human teams never solved it. Others took hours.

thumb_up_off_alt16

chat_bubble_outline0

repeat1

shareShare

Emmanuel Ameisen

@mlpowered

4 months ago

Lots of good research came from the first iteration of the program, including an open source mech interp library to trace circuits (github.com/safety-researc…). Recommend applying if you're interested!

thumb_up_off_alt21

chat_bubble_outline0

repeat1

shareShare

Claude

@claudeai

4 months ago

You're absolutely right.

thumb_up_off_alt18,18K

chat_bubble_outline1,1K

repeat870

shareShare

Emmanuel Ameisen

@mlpowered

4 months ago

In which the gang (Runjin Chen, Andy Arditi, Jack Lindsey ): - identifies vectors for bad personas (evil, sycophancy, hallucinations, etc) - shows that if you inject the bad vectors in training, the model learns to not do the bad thing!! aka vaccines but for LLMs

thumb_up_off_alt93

chat_bubble_outline5

repeat9

shareShare

Goodfire

@goodfireai

4 months ago

New research with coauthors at Paul Jankura, Google DeepMind, EleutherAI, and Decode Research! We expand on and open-source Anthropic’s foundational circuit-tracing work. Brief highlights in thread: (1/7)

thumb_up_off_alt245

chat_bubble_outline3

repeat21

shareShare

Emmanuel Ameisen

@mlpowered

4 months ago

Researchers from Goodfire, Google DeepMind, Decode, Eleuther, and Anthropic wrote a post about tracing circuits in language models! We cover how to train replacement models and compute graphs of model internals, and even filmed a 2-hour walkthrough of interpreting some examples!

thumb_up_off_alt19

chat_bubble_outline0

repeat1

shareShare

ludwig

@ludwigabap

4 months ago

The "Circuit Analysis Research Landscape" for August 2025 is out and is an interesting read on "the landscape of interpretability methods" and model biology Qwen3 4B is also out on Circuit Tracer

thumb_up_off_alt108

chat_bubble_outline4

repeat13

shareShare

Chris Olah

@ch402

4 months ago

Our interpretability team is planning to mentor more fellows this cycle! Applications are due Aug 17.

thumb_up_off_alt310

chat_bubble_outline10

repeat18

shareShare