Alexander Panfilov (@kotekjedi_ml) Twitter Tweets • TwiCopy

Xinpeng Wang

5 months ago

New paper: We investigate multilingual refusal mechanisms in LLMs and find that: The "refusal direction" - a single vector that controls whether models reject requests - is universal across languages. Paper: arxiv.org/abs/2505.17306 🧵 Co-lead with Mingyang Wang and Yihong Liu

thumb_up_off_alt14

chat_bubble_outline1

repeat10

shareShare

Egor Zverev @ICLR 2025

@egor_zverev_ai

a month ago

📢 The talk submission portal for the EurIPS'25 Foundations of Language Model Security Workshop is now online on OpenReview. Submit a 200–300 word abstract (+ optional one slide pdf) by October 17! 👇Link

thumb_up_off_alt11

chat_bubble_outline2

repeat5

shareShare

Boyi Wei

@wei_boyi

a month ago

"Misevolve" can happen unintentionally when agents improve themselves, leading to undesired (mostly harmful) outcomes. As people focus more on self-improving agents, alignment strategies should also be adaptive to avoid emergent risks. Check this out if you are interested!

thumb_up_off_alt4

chat_bubble_outline1

repeat1

shareShare

Albert Catalán Tatjer

@actatjer

a month ago

🚨 Quantization robustness isn’t just post-hoc engineering; it’s a training-time property. Our new paper studies the role of training dynamics in quantization robustness. More in 🧵👇

thumb_up_off_alt30

chat_bubble_outline3

repeat7

shareShare

Cas (Stephen Casper)

@stephenlcasper

a month ago

In our Nature article, Yarin and I outline how building the technical toolkit for open-weight AI model safety will be key to both accessing the benefits and mitigating the risks of powerful open models. nature.com/articles/d4158…

In our Nature article, <a href="/yaringal/">Yarin</a> and I outline how building the technical toolkit for open-weight AI model safety will be key to both accessing the benefits and mitigating the risks of powerful open models.

nature.com/articles/d4158…

thumb_up_off_alt51

chat_bubble_outline2

repeat9

shareShare

Shashwat Goel

@shashwatgoel7

a month ago

holy shit gb200s are such beasts?! ig its good we'll be using them in Tübingen 😉 phd application deadline is usually mid-nov ;) ;)

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare

Nikhil Chandak

@nikhilchandak29

a month ago

Thinking of doing a PhD/research visit? Consider applying to/visiting ELLIS Institute Tübingen Intelligent Systems. We have PIs doing cool work, great funding, and most importantly, one of the best academic compute: 50+ GB200s, 250+ H100s, and many more A100 80GBs. Come join us!

Thinking of doing a PhD/research visit? Consider applying to/visiting <a href="/ELLISInst_Tue/">ELLIS Institute Tübingen</a> <a href="/MPI_IS/">Intelligent Systems</a>.

We have PIs doing cool work, great funding, and most importantly, one of the best academic compute:

50+ GB200s, 250+ H100s, and many more A100 80GBs. Come join us!

thumb_up_off_alt429

chat_bubble_outline16

repeat32

shareShare

Alexander Rubinstein

@a_rubique

a month ago

🪩 Evaluate your LLMs on benchmarks like MMLU at 1% cost. In our new paper, we show that outputs on a small subset of test samples that maximise diversity in model responses are predictive of the full dataset performance. Project page: arubique.github.io/disco-site/ More below 🧵👇

thumb_up_off_alt10

chat_bubble_outline1

repeat4

shareShare

Ameya P.

@amyprb

a month ago

In NYC this weekend! Peeps around here-- would love to meetup!

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Kazuki Egashira

@kazukiega

a month ago

🚨 Be careful when pruning an LLM! 🚨 Even when the model appears benign, it might start behaving maliciously (e.g., jailbroken) once you download and prune it. Here’s how our attack works 🧵 arxiv.org/abs/2510.07985

thumb_up_off_alt18

chat_bubble_outline1

repeat14

shareShare

Florian Tramèr

@florian_tramer

a month ago

5 years ago, I wrote a paper with Wieland Brendel Aleksander Madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...

5 years ago, I wrote a paper with <a href="/wielandbr/">Wieland Brendel</a> <a href="/aleks_madry/">Aleksander Madry</a> and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks.

Has anything changed?

Nope...

thumb_up_off_alt179

chat_bubble_outline5

repeat26

shareShare

Zifan (Sail) Wang

@_zifan_wang

a month ago

GOAT :-) tho I think several take away are already shown like issues with circuit breakers & human red teaming works well. However, in general don’t think many non-adv ML ppl are actually that familiar with these conclusions so it’s still great to be able to talk thru these

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Ameya P.

@amyprb

a month ago

AI Control defenses often ignore attacks on the LLM monitor which lies at its core. Our work: Can we perform adaptive attacks on the trusted LLM monitor? We break all (current) AI control protocols and I conjecture: *This is a strong attack against future AI control protocols*

thumb_up_off_alt8

chat_bubble_outline2

repeat3

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

a month ago

🚨 Check out our new paper on AI control (!) ⚠️ Adaptive adversarial attacks are very important for *high-stakes* settings like AI control. As we show, all current AI control protocols can be easily broken since monitoring models are not adversarially robust. 👇👇👇

thumb_up_off_alt48

chat_bubble_outline0

repeat5

shareShare

Jonas Geiping

@jonasgeiping

a month ago

How would you oversee an untrusted (and more intelligent) model with a weaker, trusted model? This is the problem of AI control. And it's is a security problem, so we should approach it from those principles; the untrusted model is aware of AI control and that it's monitored ..

thumb_up_off_alt16

chat_bubble_outline1

repeat4

shareShare