
Lyna Kim
@hlynakim
@stanford | acq. founder | AI for Weather & Climate 🎈
ID: 1603614919739379712
16-12-2022 04:56:16
27 Tweet
71 Takipçi
147 Takip Edilen


We all know ChatGPT can be jailbroken. But what makes these attacks possible? In a new paper (w/ Nika Haghtalab Jacob Steinhardt), we delve into how safety training fails, identifying two key failure modes that enable jailbreaks: competing objectives and mismatched generalization 1/



