Hadas Orgad (@orgadhadas) 's Twitter Profile
Hadas Orgad

@orgadhadas

PhD student @ Technion | Focused on AI interpretability, robustness & safety | Because black boxes don’t belong in critical systems

ID: 1121835405454786561

linkhttps://orgadhadas.github.io/ calendar_today26-04-2019 17:56:18

171 Tweet

456 Followers

116 Following

Tal Haklay (@tal_haklay) 's Twitter Profile Photo

We received more submissions to our Actionable Interpretability workshop (Actionable Interpretability Workshop ICML2025) than expected, and we're now looking for additional reviewers! We're seeking reviewers to handle 2–3 papers between May 24 – June 7. Sign up here: forms.gle/FLToWY3keb832n… Thank you! 🙏

We received more submissions to our Actionable Interpretability workshop (<a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a>) than expected, and we're now looking for additional reviewers!

We're seeking reviewers to handle 2–3 papers between May 24 – June 7.

Sign up here: forms.gle/FLToWY3keb832n…

Thank you! 🙏
Tomer Ashuach (@tomerashuach) 's Twitter Profile Photo

🚨New paper at #ACL2025 Findings! REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space. LMs memorize and leak sensitive data—emails, SSNs, URLs from their training. We propose a surgical method to unlearn it. 🧵👇w/Yonatan Belinkov Martin Tutek 1/8

🚨New paper at #ACL2025 Findings!
REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space.
LMs memorize and leak sensitive data—emails, SSNs, URLs from their training. 
We propose a surgical method to unlearn it.
🧵👇w/<a href="/boknilev/">Yonatan Belinkov</a> <a href="/mtutek/">Martin Tutek</a>
1/8
Hadas Orgad (@orgadhadas) 's Twitter Profile Photo

A really interesting paper by Dana and others, dividing SAE features into two groups: input and output features. Output features are actually pretty useful for steering!

Hadas Orgad (@orgadhadas) 's Twitter Profile Photo

I'm excited that our Actionable Interpretability Workshop ICML2025 workshop at ICML Conference received over 150 submissions! We had to expand our reviewer pool to accommodate all submissions. I hope this reflects a growing interest in more actionable approaches to interpretability.

I'm excited that our <a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a> workshop at <a href="/icmlconf/">ICML Conference</a> received over 150 submissions! We had to expand our reviewer pool to accommodate all submissions. I hope this reflects a growing interest in more actionable approaches to interpretability.
Yaniv Nikankin (@ynikankin) 's Twitter Profile Photo

VLMs perform better when answering questions about text than when answering the same questions about images - but why? and how can we fix it? We investigate this gap from a mechanistic interpretability perspective, and use our findings to close a third of it! 🧵

VLMs perform better when answering questions about text than when answering the same questions about images - but why? and how can we fix it?

We investigate this gap from a mechanistic interpretability perspective, and use our findings to close a third of it! 🧵
Mor Geva (@megamor2) 's Twitter Profile Photo

Going to #icml2025? Don't miss the Actionable Interpretability Workshop (Actionable Interpretability Workshop ICML2025)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨

Going to #icml2025? Don't miss the Actionable Interpretability Workshop (<a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a>)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨
Zorik Gekhman (@zorikgekhman) 's Twitter Profile Photo

Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts 😵 Looking forward to presenting and discussing our work in person.

Hadas Orgad (@orgadhadas) 's Twitter Profile Photo

After a thunderstorm cancelled my flight, I finally made it to Vancouver for #ICML2025 and the Actionable Interpretability Workshop ICML2025 workshop! DM if you want to chat about using interpretability for safer and more controllable AI. We’ll also present the Mech-Interp Benchmark (MIB) on Thu @ 11:00—come by!

After a thunderstorm cancelled my flight, I finally made it to Vancouver for #ICML2025 and the <a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a> workshop!

DM if you want to chat about using interpretability for safer and more controllable AI.

We’ll also present the Mech-Interp Benchmark (MIB) on Thu @ 11:00—come by!
Hadas Orgad (@orgadhadas) 's Twitter Profile Photo

Hope everyone’s getting the most out of #icml25. We’re excited and ready for the Actionable Interpretability (Actionable Interpretability Workshop ICML2025) workshop this Saturday! Check out the schedule and join us to discuss how we can move interpretability toward more practical impact.

Hope everyone’s getting the most out of #icml25. We’re excited and ready for the Actionable Interpretability (<a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a>) workshop this Saturday!
Check out the schedule and join us to discuss how we can move interpretability toward more practical impact.
Tomer Ashuach (@tomerashuach) 's Twitter Profile Photo

🎉 Presenting my poster today at #ACL2025 ! REVS: Unlearning Sensitive Info in LMs via Rank Editing Come by to chat about unlearning memorized info without gradients. 🕥 10:30–12:00 With Yonatan Belinkov & Martin Tutek 📄 Paper: arxiv.org/abs/2405.18100 🌐 Website: technion-cs-nlp.github.io/REVS/

Amir Zur (@amirzur2000) 's Twitter Profile Photo

1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls? In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other. owls.baulab.info