dron (@_dron_h) 's Twitter Profile
dron

@_dron_h

math/music/ai nerd | cs @ cambridge | prev chai, sparc, polaris | giving a semantics to the syntax

ID: 1114523137859670016

linkhttp://garden.dronhazra.com calendar_today06-04-2019 13:39:57

1,1K Tweet

261 Followers

425 Following

Luke Bailey (@lukebailey181) 's Twitter Profile Photo

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

Anthropic (@anthropicai) 's Twitter Profile Photo

We solicited external reviews from Prof. Jacob Andreas, Prof. Yoshua Bengio, Prof. Jasjeet Sekhon, and Dr. Rohin Shah. We’re grateful for their comments, which you can read at the following link: assets.anthropic.com/m/24c8d0a3a7d0…

ARC Prize (@arcprize) 's Twitter Profile Photo

New verified ARC-AGI-Pub SoTA! OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4

New verified ARC-AGI-Pub SoTA!

<a href="/OpenAI/">OpenAI</a> o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

1/4
Eric Wallace (@eric_wallace_) 's Twitter Profile Photo

Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: openai.com/index/delibera……

Transluce (@transluceai) 's Twitter Profile Photo

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

Kevin Meng (@mengk20) 's Twitter Profile Photo

AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...

AI models are *not* solving problems the way we think

using Docent, we find that Claude solves *broken* eval tasks - memorizing answers &amp; hallucinating them!

details in 🧵

we really need to look at our data harder, and it's time to rethink how we do evals...
dron (@_dron_h) 's Twitter Profile Photo

i've been working on this for the past few months! excited to share some initial results we've found trying to interpret a big reasoning model

Lee Sharkey (@leedsharkey) 's Twitter Profile Photo

I've got some big personal news: I'm joining Goodfire to lead a fundamental interpretability research team in London! This has been a while coming /n

I've got some big personal news: 

I'm joining <a href="/GoodfireAI/">Goodfire</a> to lead a fundamental interpretability research team in London!

This has been a while coming 
/n
Goodfire (@goodfireai) 's Twitter Profile Photo

Today, we're announcing our $50M Series A and sharing a preview of Ember - a universal neural programming platform that gives direct, programmable access to any AI model's internal thoughts.

Goodfire (@goodfireai) 's Twitter Profile Photo

We created a canvas that plugs into an image model’s brain. You can use it to generate images in real-time by painting with the latent concepts the model has learned. Try out Paint with Ember for yourself 👇

Nick (@nickcammarata) 's Twitter Profile Photo

if you really understand a neural network you should be able to explain and edit anything in the model by directly manipulating the activation tensor. we made a demo of this with diffusion models

Jack Merullo (@jack_merullo_) 's Twitter Profile Photo

Could we tell if gpt-oss was memorizing its training data? I.e., points where it’s reasoning vs reciting? We took a quick look at the curvature of the loss landscape of the 20B model to understand memorization and what’s happening internally during reasoning

Could we tell if gpt-oss was memorizing its training data? I.e., points where it’s reasoning vs reciting? We took a quick look at the curvature of the loss landscape of the 20B model to understand memorization and what’s happening internally during reasoning
Curt Tigges (@curttigges) 's Twitter Profile Photo

Some neat results from hacking on gpt-oss at the Goodfire internal hackathon this week: 1. MoE experts are... actually experts? 2. The model seems to know which experts it's going to use for a token from the very first layer of the model. Here we see the "business expert":

Some neat results from hacking on gpt-oss at the Goodfire internal hackathon this week:

1. MoE experts are... actually experts?
2. The model seems to know which experts it's going to use for a token from the very first layer of the model.

Here we see the "business expert":
Goodfire (@goodfireai) 's Twitter Profile Photo

New research! Post-training often causes weird, unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently? (1/7)