@karen_ullrich : Even with preference alignment, LLMs can be enticed into harmful behavior via adversarial prompts 😈. 🚨 Breaking: our theoretical findings confirm: LLM alignment is fundamentally limited! More details, on framework, statistical bounds and phenomenal defense results 👇🏻 • TwiCopy

Dr. Karen Ullrich

@karen_ullrich

+ Follow

Research scientist at FAIR NY + collab w/ Vector Institute. ❤️ Machine Learning + Information Theory. Previously, PhD at UoAmsterdam, intern at DeepMind + MSRC.

ID: 2236492597

linkhttp://karenullrich.info calendar_today08-12-2013 19:32:19

265 Tweet

5,5K Followers

586 Following

Dr. Karen Ullrich

@karen_ullrich

a year ago

Even with preference alignment, LLMs can be enticed into harmful behavior via adversarial prompts 😈. 🚨 Breaking: our theoretical findings confirm: LLM alignment is fundamentally limited! More details, on framework, statistical bounds and phenomenal defense results 👇🏻

thumb_up_off_alt88

chat_bubble_outline3

repeat20

shareShare