Giorgi Giglemiani (@giglema) Twitter Tweets • TwiCopy

Xander Davies

10 months ago

We at AI Security Institute worked with OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4

We at <a href="/AISecurityInst/">AI Security Institute</a> worked with <a href="/OpenAI/">OpenAI</a> to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4

thumb_up_off_alt135

chat_bubble_outline3

repeat24

shareShare

New blog! We AI Security Institute partnered with NCSC UK to write about an emerging practice I'm really excited about: Safeguard Bypass Bounty Programmes (SBBPs). Summary of what these are, why they are useful, & how to do them well 🧵

thumb_up_off_alt50

chat_bubble_outline2

repeat11

shareShare

Robert Kirk

@_robertkirk

7 months ago

We at AI Security Institute recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of Anthropic's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

We at <a href="/AISecurityInst/">AI Security Institute</a> recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of <a href="/AnthropicAI/">Anthropic</a>'s Claude Sonnet 4.5!

This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

thumb_up_off_alt49

chat_bubble_outline3

repeat12

shareShare

Xander Davies

@alxndrdavies

4 months ago

1) We've found universal jailbreaks for every system we've tested. This includes universal jailbreaks that are simple to use and don't degrade capabilities. All of these were found within a few days of attacking. So expert red teamers are still on top for now!

thumb_up_off_alt24

chat_bubble_outline1

repeat4

shareShare

Xander Davies

@alxndrdavies

3 months ago

This is the paper I'm most proud of to date! We built the first automated jailbreaking method that finds universal jailbreaks against Constitutional Classifiers and GPT-5's Input Classifiers. How & why we did it 🧵

thumb_up_off_alt150

chat_bubble_outline6

repeat27

shareShare

Xander Davies

@alxndrdavies

2 months ago

The Red Team at AI Security Institute is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

The Red Team at <a href="/AISecurityInst/">AI Security Institute</a> is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

thumb_up_off_alt204

chat_bubble_outline3

repeat33

shareShare

Xander Davies

@alxndrdavies

9 days ago

We AI Security Institute tested GPT-5.5's cyber safeguards, developing a universal jailbreak in 6 hours of red teaming. AISI also performed cyber capabilities testing -- more in the system card.

We <a href="/AISecurityInst/">AI Security Institute</a> tested GPT-5.5's cyber safeguards, developing a universal jailbreak in 6 hours of red teaming. AISI also performed cyber capabilities testing -- more in the system card.

thumb_up_off_alt123

chat_bubble_outline6

repeat22

shareShare