Dr. Karen Ullrich (@karen_ullrich) 's Twitter Profile
Dr. Karen Ullrich

@karen_ullrich

Research scientist at FAIR NY + collab w/ Vector Institute. โค๏ธ Machine Learning + Information Theory. Previously, PhD at UoAmsterdam, intern at DeepMind + MSRC.

ID: 2236492597

linkhttp://karenullrich.info calendar_today08-12-2013 19:32:19

265 Tweet

5,5K Takipรงi

586 Takip Edilen

Dr. Karen Ullrich (@karen_ullrich) 's Twitter Profile Photo

Even with preference alignment, LLMs can be enticed into harmful behavior via adversarial prompts ๐Ÿ˜ˆ. ๐Ÿšจ Breaking: our theoretical findings confirm: LLM alignment is fundamentally limited! More details, on framework, statistical bounds and phenomenal defense results ๐Ÿ‘‡๐Ÿป

Even with preference alignment, LLMs can be enticed into harmful behavior via adversarial prompts  ๐Ÿ˜ˆ.

๐Ÿšจ Breaking: our theoretical findings confirm:
LLM alignment is fundamentally limited!

More details, on framework, statistical bounds and phenomenal defense results ๐Ÿ‘‡๐Ÿป