Hadas Orgad (@orgadhadas) Twitter Tweets • TwiCopy

Hadas Orgad

@orgadhadas

+ Follow

PhD student @ Technion | Focused on AI interpretability, robustness & safety | Because black boxes don’t belong in critical systems

ID: 1121835405454786561

linkhttps://orgadhadas.github.io/ calendar_today26-04-2019 17:56:18

171 Tweet

456 Followers

116 Following

Tal Haklay

@tal_haklay

6 months ago

We received more submissions to our Actionable Interpretability workshop (Actionable Interpretability Workshop ICML2025) than expected, and we're now looking for additional reviewers! We're seeking reviewers to handle 2–3 papers between May 24 – June 7. Sign up here: forms.gle/FLToWY3keb832n… Thank you! 🙏

thumb_up_off_alt13

chat_bubble_outline0

repeat3

shareShare

Tomer Ashuach

@tomerashuach

6 months ago

🚨New paper at #ACL2025 Findings! REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space. LMs memorize and leak sensitive data—emails, SSNs, URLs from their training. We propose a surgical method to unlearn it. 🧵👇w/Yonatan Belinkov Martin Tutek 1/8

thumb_up_off_alt69

chat_bubble_outline1

repeat17

shareShare

Hadas Orgad

@orgadhadas

6 months ago

A really interesting paper by Dana and others, dividing SAE features into two groups: input and output features. Output features are actually pretty useful for steering!

thumb_up_off_alt22

chat_bubble_outline0

repeat4

shareShare

Hadas Orgad

@orgadhadas

6 months ago

I'm excited that our Actionable Interpretability Workshop ICML2025 workshop at ICML Conference received over 150 submissions! We had to expand our reviewer pool to accommodate all submissions. I hope this reflects a growing interest in more actionable approaches to interpretability.

I'm excited that our <a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a> workshop at <a href="/icmlconf/">ICML Conference</a> received over 150 submissions! We had to expand our reviewer pool to accommodate all submissions. I hope this reflects a growing interest in more actionable approaches to interpretability.

thumb_up_off_alt20

chat_bubble_outline0

repeat0

shareShare

Yaniv Nikankin

@ynikankin

5 months ago

VLMs perform better when answering questions about text than when answering the same questions about images - but why? and how can we fix it? We investigate this gap from a mechanistic interpretability perspective, and use our findings to close a third of it! 🧵

thumb_up_off_alt148

chat_bubble_outline1

repeat25

shareShare

Mor Geva

@megamor2

4 months ago

Going to #icml2025? Don't miss the Actionable Interpretability Workshop (Actionable Interpretability Workshop ICML2025)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨

Going to #icml2025? Don't miss the Actionable Interpretability Workshop (<a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a>)! We've got an amazing lineup of speakers, panelists, and papers, all focused on leveraging insights from interpretability research to tackle practical, real-world problems ✨

thumb_up_off_alt43

chat_bubble_outline1

repeat5

shareShare

Zorik Gekhman

@zorikgekhman

4 months ago

Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts 😵 Looking forward to presenting and discussing our work in person.

thumb_up_off_alt58

chat_bubble_outline1

repeat14

shareShare

Hadas Orgad

@orgadhadas

4 months ago

After a thunderstorm cancelled my flight, I finally made it to Vancouver for #ICML2025 and the Actionable Interpretability Workshop ICML2025 workshop! DM if you want to chat about using interpretability for safer and more controllable AI. We’ll also present the Mech-Interp Benchmark (MIB) on Thu @ 11:00—come by!

After a thunderstorm cancelled my flight, I finally made it to Vancouver for #ICML2025 and the <a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a> workshop!

DM if you want to chat about using interpretability for safer and more controllable AI.

We’ll also present the Mech-Interp Benchmark (MIB) on Thu @ 11:00—come by!

thumb_up_off_alt19

chat_bubble_outline0

repeat0

shareShare

Hadas Orgad

@orgadhadas

4 months ago

Hope everyone’s getting the most out of #icml25. We’re excited and ready for the Actionable Interpretability (Actionable Interpretability Workshop ICML2025) workshop this Saturday! Check out the schedule and join us to discuss how we can move interpretability toward more practical impact.

Hope everyone’s getting the most out of #icml25. We’re excited and ready for the Actionable Interpretability (<a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a>) workshop this Saturday!
Check out the schedule and join us to discuss how we can move interpretability toward more practical impact.

thumb_up_off_alt31

chat_bubble_outline0

repeat5

shareShare

Hadas Orgad

@orgadhadas

4 months ago

It's really fun to walk around a poster session and literally want to stop by each one!

thumb_up_off_alt21

chat_bubble_outline0

repeat0

shareShare

Evžen

@evzen_wy

4 months ago

Crazy amount of cool work concentrated in one room

thumb_up_off_alt15

chat_bubble_outline0

repeat4

shareShare

Aryaman Arora

@aryaman2020

4 months ago

maybe I will live tweet the actionable interp workshop panel

thumb_up_off_alt101

chat_bubble_outline11

repeat8

shareShare

Tomer Ashuach

@tomerashuach

4 months ago

🎉 Presenting my poster today at #ACL2025 ! REVS: Unlearning Sensitive Info in LMs via Rank Editing Come by to chat about unlearning memorized info without gradients. 🕥 10:30–12:00 With Yonatan Belinkov & Martin Tutek 📄 Paper: arxiv.org/abs/2405.18100 🌐 Website: technion-cs-nlp.github.io/REVS/

thumb_up_off_alt20

chat_bubble_outline0

repeat8

shareShare

Amir Zur

@amirzur2000

3 months ago

1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls? In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other. owls.baulab.info

thumb_up_off_alt648

chat_bubble_outline18

repeat70

shareShare