profile-img
Javier Rando

@javirandor

Red-Teaming LLMs | PhD Student @ETH_AI_Center | Incoming intern @Meta | Vegan 🌱

calendar_today18-10-2018 07:57:32

1,1K Tweets

903 Followers

588 Following

Javier Rando(@javirandor) 's Twitter Profile Photo

🧡 Can data poisoning and RLHF be combined to unlock a universal jailbreak backdoor in LLMs?

Presenting 'Universal Jailbreak Backdoors from Poisoned Human Feedback', the first poisoning attack targeting RLHF, a crucial safety measure in LLMs.

πŸ“– Paper: arxiv.org/abs/2311.14455

🧡 Can data poisoning and RLHF be combined to unlock a universal jailbreak backdoor in LLMs? Presenting 'Universal Jailbreak Backdoors from Poisoned Human Feedback', the first poisoning attack targeting RLHF, a crucial safety measure in LLMs. πŸ“– Paper: arxiv.org/abs/2311.14455
account_circle