Jerry Wei (@jerryweiai) 's Twitter Profile
Jerry Wei

@jerryweiai

Aligning AIs at @AnthropicAI

⏰ Past: @GoogleDeepMind, @Stanford, @Google Brain

ID: 3352510817

linkhttp://www.jerrywei.net calendar_today30-06-2015 22:52:29

292 Tweet

8,8K Followers

464 Following

Jan Leike (@janleike) 's Twitter Profile Photo

We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? claude.ai/constitutional…

Pietro Schirano (@skirano) 's Twitter Profile Photo

Excited to announce a new research preview at Anthropic today. A demo of our new Constitutional Classifiers. Can you break the system and find a universal jailbreak that lets the model answer all 8 questions we've defined?

Alex Albert (@alexalbert__) 's Twitter Profile Photo

At Anthropic, we're preparing for the arrival of powerful AI systems. Based on our latest research on Constitutional Classifiers, we've developed a demo app to test new safety techniques. We want you to help us red-team the app - so far no one has been able to crack the

Haize Labs (@haizelabs) 's Twitter Profile Photo

📜 really excited to share our work with Anthropic on Constitutional Classifiers! tldr: adding lightweight, tailored, input/output classifiers on top of an underlying LLM creates an AI system that's much more robust to universal jailbreaks

Leonard Tang (@leonardtang_) 's Twitter Profile Photo

excited for this one! Haize Labs worked with Anthropic to on Constitutional Classifiers, which are lightweight, highly-specific, input/output guardrails mitigating even the strongest of jailbreaks blog: anthropic.com/research/const… full paper: anthropic.com/research/const…

Ethan Perez (@ethanjperez) 's Twitter Profile Photo

After thousands of hours of red teaming, we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks, a key threat for misusing LLMs. Try jailbreaking the model yourself, using our demo here: claude.ai/constitutional…

Jan Leike (@janleike) 's Twitter Profile Photo

Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.

Jan Leike (@janleike) 's Twitter Profile Photo

It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far: signups: 6,121 messages sent: 131,605 max level passed: 3 / 8 no universal jailbreak yet

It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far:

signups: 6,121
messages sent: 131,605
max level passed: 3 / 8

no universal jailbreak yet
Jan Leike (@janleike) 's Twitter Profile Photo

It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3

It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3
Anthropic (@anthropicai) 's Twitter Profile Photo

Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…

Jan Leike (@janleike) 's Twitter Profile Photo

After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels. However, a universal jailbreak has yet to be found...

Jerry Wei (@jerryweiai) 's Twitter Profile Photo

Really excited to see this result from our demo of constitutional classifiers! When red teaming a prototype version of our system, we found that the system was robust to thousands of hours of collective red-teaming effort. Following that, we developed a new system with 100x

Anthropic (@anthropicai) 's Twitter Profile Photo

Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking. One model, two ways to think. We’re also releasing an agentic coding tool: Claude Code.

Anthropic (@anthropicai) 's Twitter Profile Photo

We've conducted extensive model testing for security, safety, and reliability. We also listened to your feedback. With Claude 3.7 Sonnet, we've reduced unnecessary refusals by 45% compared to its predecessor. See the system card for more detail: anthropic.com/claude-3-7-son…

We've conducted extensive model testing for security, safety, and reliability.

We also listened to your feedback. With Claude 3.7 Sonnet, we've reduced unnecessary refusals by 45% compared to its predecessor.

See the system card for more detail: anthropic.com/claude-3-7-son…
Jerry Wei (@jerryweiai) 's Twitter Profile Photo

SWE-Bench is cool but I care more about the Pokemon evals. I'll be convinced of AGI when the model can beat Red from Pokemon Heartgold/Soulsilver first try.

SWE-Bench is cool but I care more about the Pokemon evals. 

I'll be convinced of AGI when the model can beat Red from Pokemon Heartgold/Soulsilver first try.
Jerry Wei (@jerryweiai) 's Twitter Profile Photo

Today marks my one-year anniversary at Anthropic, and I've been reflecting on some of the most impactful lessons I've learned during this incredible journey. One of the most striking realizations has been just how much a small, talent-dense team can accomplish. When I first

Claude (@claudeai) 's Twitter Profile Photo

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

Introducing Claude Sonnet 4.5—the best coding model in the world.

It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.