Alexander Panfilov
@kotekjedi_ml
IMPRS-IS & ELLIS PhD Student @ Tübingen
Interested in Trustworthy ML, Security in ML and AI Safety.
ID: 2166995768
https://kotekjedi.github.io/ 31-10-2013 17:48:16
62 Tweet
74 Followers
96 Following
New paper: We investigate multilingual refusal mechanisms in LLMs and find that: The "refusal direction" - a single vector that controls whether models reject requests - is universal across languages. Paper: arxiv.org/abs/2505.17306 🧵 Co-lead with Mingyang Wang and Yihong Liu
Thinking of doing a PhD/research visit? Consider applying to/visiting ELLIS Institute Tübingen Intelligent Systems. We have PIs doing cool work, great funding, and most importantly, one of the best academic compute: 50+ GB200s, 250+ H100s, and many more A100 80GBs. Come join us!
5 years ago, I wrote a paper with Wieland Brendel Aleksander Madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...