nostalgebraist (@nostalgebraist) Twitter Tweets • TwiCopy

nostalgebraist

@nostalgebraist

+ Follow

ID: 446638118

linkhttps://nostalgebraist.tumblr.com calendar_today26-12-2011 00:11:56

752 Tweet

1,1K Takipçi

405 Takip Edilen

nostalgebraist

@nostalgebraist

5 months ago

gpt-oss-20b on the reproductive biology of "Slime Prime"

gpt-oss-20b on the reproductive biology of "Slime Prime"

thumb_up_off_alt16

chat_bubble_outline1

repeat0

shareShare

Transluce

4 months ago

At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)

At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator!

We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)

thumb_up_off_alt232

chat_bubble_outline5

repeat37

shareShare

Transluce

3 months ago

We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.

thumb_up_off_alt79

chat_bubble_outline1

repeat13

shareShare

Transluce

2 months ago

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow Belinda Li, we find that they can—and that models explain themselves better than other models do.

Can LMs learn to faithfully describe their internal features and mechanisms?

In our new paper led by Research Fellow <a href="/belindazli/">Belinda Li</a>, we find that they can—and that models explain themselves better than other models do.

thumb_up_off_alt221

chat_bubble_outline5

repeat50

shareShare

Transluce

2 months ago

Transluce is partnering with SWE-bench to make their agent trajectories publicly available on Docent! You can now view transcripts via links on the SWE-bench leaderboard.

Transluce is partnering with <a href="/SWEbench/">SWE-bench</a> to make their agent trajectories publicly available on Docent!

You can now view transcripts via links on the SWE-bench leaderboard.

thumb_up_off_alt41

chat_bubble_outline2

repeat13

shareShare

Transluce

2 months ago

Is your LM secretly an SAE? Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits!

Is your LM secretly an SAE?

Most circuit-finding interpretability methods use learned features rather than raw activations, based on the belief that neurons do not cleanly decompose computation. In our new work, we show MLP neurons actually do support sparse, faithful circuits!

thumb_up_off_alt243

chat_bubble_outline5

repeat64

shareShare

Transluce

a month ago

What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.

thumb_up_off_alt77

chat_bubble_outline5

repeat25

shareShare