Alexander Panfilov (@kotekjedi_ml) 's Twitter Profile
Alexander Panfilov

@kotekjedi_ml

IMPRS-IS & ELLIS PhD Student @ Tübingen

Interested in Trustworthy ML, Security in ML and AI Safety.

ID: 2166995768

linkhttps://kotekjedi.github.io/ calendar_today31-10-2013 17:48:16

62 Tweet

74 Followers

96 Following

Xinpeng Wang (@xinpengwang_) 's Twitter Profile Photo

New paper: We investigate multilingual refusal mechanisms in LLMs and find that: The "refusal direction" - a single vector that controls whether models reject requests - is universal across languages. Paper: arxiv.org/abs/2505.17306 🧵 Co-lead with Mingyang Wang and Yihong Liu

New paper: We investigate multilingual refusal mechanisms in LLMs and find that:
The "refusal direction" - a single vector that controls whether models reject requests - is universal across languages.
Paper: arxiv.org/abs/2505.17306 🧵
Co-lead with <a href="/mingyang2666/">Mingyang Wang</a> and <a href="/yhLiu96/">Yihong Liu</a>
Egor Zverev @ICLR 2025 (@egor_zverev_ai) 's Twitter Profile Photo

📢 The talk submission portal for the EurIPS'25 Foundations of Language Model Security Workshop is now online on OpenReview. Submit a 200–300 word abstract (+ optional one slide pdf) by October 17! 👇Link

Boyi Wei (@wei_boyi) 's Twitter Profile Photo

"Misevolve" can happen unintentionally when agents improve themselves, leading to undesired (mostly harmful) outcomes. As people focus more on self-improving agents, alignment strategies should also be adaptive to avoid emergent risks. Check this out if you are interested!

Albert Catalán Tatjer (@actatjer) 's Twitter Profile Photo

🚨 Quantization robustness isn’t just post-hoc engineering; it’s a training-time property. Our new paper studies the role of training dynamics in quantization robustness. More in 🧵👇

Cas (Stephen Casper) (@stephenlcasper) 's Twitter Profile Photo

In our Nature article, Yarin and I outline how building the technical toolkit for open-weight AI model safety will be key to both accessing the benefits and mitigating the risks of powerful open models. nature.com/articles/d4158…

In our Nature article, <a href="/yaringal/">Yarin</a> and I outline how building the technical toolkit for open-weight AI model safety will be key to both accessing the benefits and mitigating the risks of powerful open models.

nature.com/articles/d4158…
Shashwat Goel (@shashwatgoel7) 's Twitter Profile Photo

holy shit gb200s are such beasts?! ig its good we'll be using them in Tübingen 😉 phd application deadline is usually mid-nov ;) ;)

holy shit gb200s are such beasts?! 

ig its good we'll be using them in Tübingen 😉

phd application deadline is usually mid-nov ;) ;)
Nikhil Chandak (@nikhilchandak29) 's Twitter Profile Photo

Thinking of doing a PhD/research visit? Consider applying to/visiting ELLIS Institute Tübingen Intelligent Systems. We have PIs doing cool work, great funding, and most importantly, one of the best academic compute: 50+ GB200s, 250+ H100s, and many more A100 80GBs. Come join us!

Thinking of doing a PhD/research visit? Consider applying to/visiting <a href="/ELLISInst_Tue/">ELLIS Institute Tübingen</a> <a href="/MPI_IS/">Intelligent Systems</a>. 

We have PIs doing cool work, great funding, and most importantly, one of the best academic compute: 

50+ GB200s, 250+ H100s, and many more A100 80GBs. Come join us!
Alexander Rubinstein (@a_rubique) 's Twitter Profile Photo

🪩 Evaluate your LLMs on benchmarks like MMLU at 1% cost. In our new paper, we show that outputs on a small subset of test samples that maximise diversity in model responses are predictive of the full dataset performance. Project page: arubique.github.io/disco-site/ More below 🧵👇

🪩 Evaluate your LLMs on benchmarks like MMLU at 1% cost.

In our new paper, we show that outputs on a small subset of test samples that maximise diversity in model responses are predictive of the full dataset performance.

Project page: arubique.github.io/disco-site/

More below 🧵👇
Kazuki Egashira (@kazukiega) 's Twitter Profile Photo

🚨 Be careful when pruning an LLM! 🚨 Even when the model appears benign, it might start behaving maliciously (e.g., jailbroken) once you download and prune it. Here’s how our attack works 🧵 arxiv.org/abs/2510.07985

🚨 Be careful when pruning an LLM! 🚨

Even when the model appears benign, it might start behaving maliciously (e.g., jailbroken) once you download and prune it.

Here’s how our attack works 🧵

arxiv.org/abs/2510.07985
Florian Tramèr (@florian_tramer) 's Twitter Profile Photo

5 years ago, I wrote a paper with Wieland Brendel Aleksander Madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...

5 years ago, I wrote a paper with <a href="/wielandbr/">Wieland Brendel</a> <a href="/aleks_madry/">Aleksander Madry</a> and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks.

Has anything changed?

Nope...
Zifan (Sail) Wang (@_zifan_wang) 's Twitter Profile Photo

GOAT :-) tho I think several take away are already shown like issues with circuit breakers & human red teaming works well. However, in general don’t think many non-adv ML ppl are actually that familiar with these conclusions so it’s still great to be able to talk thru these

Ameya P. (@amyprb) 's Twitter Profile Photo

AI Control defenses often ignore attacks on the LLM monitor which lies at its core. Our work: Can we perform adaptive attacks on the trusted LLM monitor? We break all (current) AI control protocols and I conjecture: *This is a strong attack against future AI control protocols*

Maksym Andriushchenko @ ICLR (@maksym_andr) 's Twitter Profile Photo

🚨 Check out our new paper on AI control (!) ⚠️ Adaptive adversarial attacks are very important for *high-stakes* settings like AI control. As we show, all current AI control protocols can be easily broken since monitoring models are not adversarially robust. 👇👇👇

Jonas Geiping (@jonasgeiping) 's Twitter Profile Photo

How would you oversee an untrusted (and more intelligent) model with a weaker, trusted model? This is the problem of AI control. And it's is a security problem, so we should approach it from those principles; the untrusted model is aware of AI control and that it's monitored ..