Craig Citro (@craigcitro) 's Twitter Profile
Craig Citro

@craigcitro

i like math and puns

| research engineer @anthropicai; previously: @GoogleColab, Google Bigquery, @sagemath, number theorist

ID: 94276164

linkhttps://www.craigcitro.org/ calendar_today03-12-2009 07:05:50

3,3K Tweet

1,1K Takipçi

259 Takip Edilen

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

Joshua Batson (@thebasepoint) 's Twitter Profile Photo

We did a thing. A new method for looking inside AI models, and ten deep dives on what we see. I spent my word budget on the paper, so today I'll just highlight some of the threads from the team. 🧵

Craig Citro (@craigcitro) 's Twitter Profile Photo

as I told several people, I was hoping to "dive into the deep end" when I joined anthropic. i was not sure what I expected, but it was been way, WAY above expectations. super proud of this work, check it out.

Jack Lindsey (@jack_w_lindsey) 's Twitter Profile Photo

Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?” We’re starting to build tools to find out! Some reflections in thread.

Wes Gurnee (@wesg52) 's Twitter Profile Photo

We tried to build a “microscope” to understand how Claude works. There are still many things which we cannot see clearly, but there are many exciting things that are coming into focus! A few reflections and exciting results:

Adam Pearce (@adamrpearce) 's Twitter Profile Photo

Addition has been extensively studied in simple toy models. In our latest paper, we describe a method for untangling circuits of computations and examine how Claude understands "calc: 36+59=" anthropic.com/research/traci…

Addition has been extensively studied in simple toy models. In our latest paper, we describe a method for untangling circuits of computations and examine how Claude understands "calc: 36+59="

anthropic.com/research/traci…
Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

We use language models like Claude to help us write, code, and think better. But we don’t understand how they work! We’ve built a new tool which allows us to look inside the model’s “brain” as it is “thinking” Using it, we found really surprising behaviors 🧵

We use language models like Claude to help us write, code, and think better.

But we don’t understand how they work!

We’ve built a new tool which allows us to look inside the model’s “brain” as it is “thinking”

Using it, we found really surprising behaviors 🧵
Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

We've made progress in our quest to understand how Claude and models like it think! The paper has many fun and surprising case studies, that anyone who is interested in LLMs would enjoy. Check out the video below for an example

Nicholas Turner (@nicholasturner0) 's Twitter Profile Photo

As people that know me well can attest, I love a good mystery! 🔍 Fortunately for me, this work had twists both surprising and peculiar. 🧵

Trenton Bricken (@trentonbricken) 's Twitter Profile Photo

My favorite figure from our new Circuits papers -- "How does Claude do math?" Claude simultaneously does: 1. a back of the envelope calculation of the tens digits -- "the answer should be somewhere around 90". 2. an exact calculation of 6+9=15 using these super cool look up

My favorite figure from our new Circuits papers -- "How does Claude do math?"

Claude simultaneously does:
1. a back of the envelope calculation of the tens digits -- "the answer should be somewhere around 90". 
2. an exact calculation of 6+9=15 using these super cool look up
Jack Lindsey (@jack_w_lindsey) 's Twitter Profile Photo

We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic!  We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…

Chris Olah (@ch402) 's Twitter Profile Photo

It's been a busy week for the Anthropic interpretability team, with more to come in the near future! I wanted to recap some of the things we shared.