nostalgebraist (@nostalgebraist) 's Twitter Profile
nostalgebraist

@nostalgebraist

ID: 446638118

linkhttps://nostalgebraist.tumblr.com calendar_today26-12-2011 00:11:56

752 Tweet

1,1K Takipçi

405 Takip Edilen

Transluce (@transluceai) 's Twitter Profile Photo

At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)

At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator!

We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)
Transluce (@transluceai) 's Twitter Profile Photo

We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.

Transluce (@transluceai) 's Twitter Profile Photo

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow Belinda Li, we find that they can—and that models explain themselves better than other models do.

Can LMs learn to faithfully describe their internal features and mechanisms?

In our new paper led by Research Fellow <a href="/belindazli/">Belinda Li</a>, we find that they can—and that models explain themselves better than other models do.
Transluce (@transluceai) 's Twitter Profile Photo

Transluce is partnering with SWE-bench to make their agent trajectories publicly available on Docent! You can now view transcripts via links on the SWE-bench leaderboard.

Transluce is partnering with <a href="/SWEbench/">SWE-bench</a> to make their agent trajectories publicly available on Docent!

You can now view transcripts via links on the SWE-bench leaderboard.
Transluce (@transluceai) 's Twitter Profile Photo

Is your LM secretly an SAE? Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits!

Is your LM secretly an SAE?

Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits!
Transluce (@transluceai) 's Twitter Profile Photo

What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.