Jerry Wei (@jerryweiai) Twitter Tweets • TwiCopy

10 months ago

Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.

thumb_up_off_alt781

chat_bubble_outline69

repeat23

shareShare

Jan Leike

10 months ago

It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far: signups: 6,121 messages sent: 131,605 max level passed: 3 / 8 no universal jailbreak yet

thumb_up_off_alt1,1K

chat_bubble_outline167

repeat68

shareShare

Jan Leike

10 months ago

It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3

thumb_up_off_alt980

chat_bubble_outline109

repeat31

shareShare

Anthropic

10 months ago

Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…

thumb_up_off_alt3,3K

chat_bubble_outline654

repeat328

shareShare

Jan Leike

10 months ago

4 days in: 12 people cleared level 4, one person cracked level 5 the challenge continues...

thumb_up_off_alt940

chat_bubble_outline87

repeat35

shareShare

Jan Leike

10 months ago

After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels. However, a universal jailbreak has yet to be found...

thumb_up_off_alt1,1K

chat_bubble_outline144

repeat76

shareShare

Jerry Wei

10 months ago

Really excited to see this result from our demo of constitutional classifiers! When red teaming a prototype version of our system, we found that the system was robust to thousands of hours of collective red-teaming effort. Following that, we developed a new system with 100x

thumb_up_off_alt49

chat_bubble_outline3

repeat3

shareShare

Anthropic

10 months ago

Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking. One model, two ways to think. We’re also releasing an agentic coding tool: Claude Code.

thumb_up_off_alt19,19K

chat_bubble_outline1,1K

repeat2,2K

shareShare

Anthropic

10 months ago

We've conducted extensive model testing for security, safety, and reliability. We also listened to your feedback. With Claude 3.7 Sonnet, we've reduced unnecessary refusals by 45% compared to its predecessor. See the system card for more detail: anthropic.com/claude-3-7-son…

thumb_up_off_alt1,1K

chat_bubble_outline76

repeat51

shareShare

Jerry Wei

10 months ago

SWE-Bench is cool but I care more about the Pokemon evals. I'll be convinced of AGI when the model can beat Red from Pokemon Heartgold/Soulsilver first try.

thumb_up_off_alt45

chat_bubble_outline2

repeat1

shareShare

Jerry Wei

10 months ago

watch claude 3.7 sonnet try to beat pokemon live gotta catch em all👇 twitch.tv/claudeplayspok…

thumb_up_off_alt18

chat_bubble_outline4

repeat0

shareShare

Anthropic

9 months ago

Claude can now search the web. Each response includes inline citations, so you can also verify the sources.

thumb_up_off_alt7,7K

chat_bubble_outline353

repeat934

shareShare

Jerry Wei