Giorgi Giglemiani (@giglema) 's Twitter Profile
Giorgi Giglemiani

@giglema

ID: 1497280181249232901

calendar_today25-02-2022 18:40:03

1 Tweet

13 Followers

156 Following

Robert Kirk (@_robertkirk) 's Twitter Profile Photo

New blog! We AI Security Institute partnered with NCSC UK to write about an emerging practice I'm really excited about: Safeguard Bypass Bounty Programmes (SBBPs). Summary of what these are, why they are useful, & how to do them well 🧵

Robert Kirk (@_robertkirk) 's Twitter Profile Photo

We at AI Security Institute recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of Anthropic's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

We at <a href="/AISecurityInst/">AI Security Institute</a> recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of <a href="/AnthropicAI/">Anthropic</a>'s Claude Sonnet 4.5!

This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
Xander Davies (@alxndrdavies) 's Twitter Profile Photo

1) We've found universal jailbreaks for every system we've tested. This includes universal jailbreaks that are simple to use and don't degrade capabilities. All of these were found within a few days of attacking. So expert red teamers are still on top for now!

Xander Davies (@alxndrdavies) 's Twitter Profile Photo

This is the paper I'm most proud of to date! We built the first automated jailbreaking method that finds universal jailbreaks against Constitutional Classifiers and GPT-5's Input Classifiers. How & why we did it 🧵

Xander Davies (@alxndrdavies) 's Twitter Profile Photo

The Red Team at AI Security Institute is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

The Red Team at <a href="/AISecurityInst/">AI Security Institute</a> is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵
Xander Davies (@alxndrdavies) 's Twitter Profile Photo

We AI Security Institute tested GPT-5.5's cyber safeguards, developing a universal jailbreak in 6 hours of red teaming. AISI also performed cyber capabilities testing -- more in the system card.

We <a href="/AISecurityInst/">AI Security Institute</a> tested GPT-5.5's cyber safeguards, developing a universal jailbreak in 6 hours of red teaming. AISI also performed cyber capabilities testing -- more in the system card.