dron (@_dron_h) Twitter Tweets • TwiCopy

Luke Bailey

a year ago

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

thumb_up_off_alt366

chat_bubble_outline11

repeat83

shareShare

Anthropic

@anthropicai

a year ago

We solicited external reviews from Prof. Jacob Andreas, Prof. Yoshua Bengio, Prof. Jasjeet Sekhon, and Dr. Rohin Shah. We’re grateful for their comments, which you can read at the following link: assets.anthropic.com/m/24c8d0a3a7d0…

thumb_up_off_alt258

chat_bubble_outline6

repeat12

shareShare

ARC Prize

@arcprize

a year ago

New verified ARC-AGI-Pub SoTA! OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4

New verified ARC-AGI-Pub SoTA!

<a href="/OpenAI/">OpenAI</a> o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

1/4

thumb_up_off_alt3,3K

chat_bubble_outline108

repeat638

shareShare

Eric Wallace

@eric_wallace_

a year ago

Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: openai.com/index/delibera……

thumb_up_off_alt404

chat_bubble_outline11

repeat62

shareShare

Transluce

@transluceai

7 months ago

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

thumb_up_off_alt330

chat_bubble_outline9

repeat66

shareShare

Kevin Meng

@mengk20

7 months ago

AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...

thumb_up_off_alt1,1K

chat_bubble_outline17

repeat107

shareShare

dron

@_dron_h

7 months ago

i've been working on this for the past few months! excited to share some initial results we've found trying to interpret a big reasoning model

thumb_up_off_alt39

chat_bubble_outline1

repeat0

shareShare

dron

@_dron_h

7 months ago

r1: <completely breaks>. ahem. well. nevertheless,

thumb_up_off_alt34

chat_bubble_outline1

repeat2

shareShare

Lee Sharkey

@leedsharkey

7 months ago

I've got some big personal news: I'm joining Goodfire to lead a fundamental interpretability research team in London! This has been a while coming /n

I've got some big personal news:

I'm joining <a href="/GoodfireAI/">Goodfire</a> to lead a fundamental interpretability research team in London!

This has been a while coming
/n

thumb_up_off_alt338

chat_bubble_outline15

repeat6

shareShare

Goodfire

@goodfireai

7 months ago

Today, we're announcing our $50M Series A and sharing a preview of Ember - a universal neural programming platform that gives direct, programmable access to any AI model's internal thoughts.

thumb_up_off_alt1,1K

chat_bubble_outline42

repeat106

shareShare

Goodfire

@goodfireai

6 months ago

We're publishing new queryable datasets to help researchers explore interpretable features in DeepSeek R1.

thumb_up_off_alt153

chat_bubble_outline4

repeat12

shareShare

max!

@maxsloef

6 months ago

i've added a little more to our recent deepseek r1 SAE launch :)

thumb_up_off_alt43

chat_bubble_outline2

repeat2

shareShare

Goodfire

@goodfireai

5 months ago

We created a canvas that plugs into an image model’s brain. You can use it to generate images in real-time by painting with the latent concepts the model has learned. Try out Paint with Ember for yourself 👇

thumb_up_off_alt917

chat_bubble_outline40

repeat97

shareShare

Nick

@nickcammarata

5 months ago

if you really understand a neural network you should be able to explain and edit anything in the model by directly manipulating the activation tensor. we made a demo of this with diffusion models

thumb_up_off_alt388

chat_bubble_outline14

repeat22

shareShare

Jack Merullo

@jack_merullo_

3 months ago

Could we tell if gpt-oss was memorizing its training data? I.e., points where it’s reasoning vs reciting? We took a quick look at the curvature of the loss landscape of the 20B model to understand memorization and what’s happening internally during reasoning

thumb_up_off_alt507

chat_bubble_outline14

repeat48

shareShare

Curt Tigges

@curttigges

3 months ago

Some neat results from hacking on gpt-oss at the Goodfire internal hackathon this week: 1. MoE experts are... actually experts? 2. The model seems to know which experts it's going to use for a token from the very first layer of the model. Here we see the "business expert":

thumb_up_off_alt50

chat_bubble_outline5

repeat6

shareShare

dron

@_dron_h

3 months ago

heck of a first week

thumb_up_off_alt15

chat_bubble_outline1

repeat0

shareShare

Goodfire

@goodfireai

3 months ago

New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)

thumb_up_off_alt340

chat_bubble_outline9

repeat38

shareShare