Itay Nakash (@itay__nakash) 's Twitter Profile
Itay Nakash

@itay__nakash

IBM Research | AI Safety | Agents, LLMs & NLP

ID: 1544329316573581315

linkhttps://itay-nakash.github.io/ calendar_today05-07-2022 14:36:36

73 Tweet

127 Followers

363 Following

Eran Hirsch (@hirscheran) 's Twitter Profile Photo

🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)! LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2

🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)!

LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2
Judd Rosenblatt — d/acc (@juddrosenblatt) 's Twitter Profile Photo

Current AI “alignment” is just a mask Our findings in The Wall Street Journal explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵

Current AI “alignment” is just a mask

Our findings in <a href="/WSJ/">The Wall Street Journal</a> explore the limitations of today’s alignment techniques and what’s needed to get AI right 🧵
Zorik Gekhman (@zorikgekhman) 's Twitter Profile Photo

Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts 😵 Looking forward to presenting and discussing our work in person.

Tal Haklay (@tal_haklay) 's Twitter Profile Photo

🚨Meet our panelists at the Actionable Interpretability Workshop Actionable Interpretability Workshop ICML2025 at ICML Conference! Join us July 19 at 4pm for a panel on making interpretability research actionable, its challenges, and how the community can drive greater impact. Naomi Saphra hiring my lab at ICML 🧈🪰 Samuel Marks Kyle Lo Fazl Barez

🚨Meet our panelists at the Actionable Interpretability Workshop <a href="/ActInterp/">Actionable Interpretability Workshop ICML2025</a> at <a href="/icmlconf/">ICML Conference</a>!

Join us July 19 at 4pm for a panel on making interpretability research actionable, its challenges, and how the community can drive greater impact.
<a href="/nsaphra/">Naomi Saphra hiring my lab at ICML 🧈🪰</a> <a href="/saprmarks/">Samuel Marks</a> <a href="/kylelostat/">Kyle Lo</a> <a href="/FazlBarez/">Fazl Barez</a>
Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

🚨New paper alert🚨 🧠 Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025🎉! See thread below 👇 #BiasInAI #LLMs #MachineLearning #NLProc

🚨New paper alert🚨

🧠 
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?

Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc
Nitay Calderon (@nitcal) 's Twitter Profile Photo

Everyone uses LLMs to annotate data or evaluate models in their research. But how can we convince others (readers, collaborators, reviewers!!!) that LLMs are reliable? 🤖 Here’s a simple (and low-effort) solution: show the LLM is a *comparable alternative annotator* ✅

Everyone uses LLMs to annotate data or evaluate models in their research.

But how can we convince others (readers, collaborators, reviewers!!!) that LLMs are reliable? 🤖

Here’s a simple (and low-effort) solution: show the LLM is a *comparable alternative annotator* ✅
Asaf Yehudai (@asafyehudai) 's Twitter Profile Photo

🚨 Benchmarks tell us which model is better — but not why it fails. For developers, this means tedious, manual error analysis. We're bridging that gap. Meet CLEAR: an open-source tool for actionable error analysis of LLMs. 🧵👇

🚨 Benchmarks tell us which model is better — but not why it fails.

For developers, this means tedious, manual error analysis. We're bridging that gap.

Meet CLEAR: an open-source tool for actionable error analysis of LLMs.

🧵👇
Tomer Ashuach (@tomerashuach) 's Twitter Profile Photo

🚨 New preprint out! CRISP: Persistent Concept Unlearning via SAEs LLMs often encode knowledge we want to remove. CRISP enables persistent, interpretable, precise unlearning while keeping models useful & coherent—tested on bio & cyber safety tasks🧵👇 📄arxiv.org/abs/2508.13650

🚨 New preprint out!

CRISP: Persistent Concept Unlearning via SAEs
LLMs often encode knowledge we want to remove.

CRISP enables persistent, interpretable, precise unlearning while keeping models useful &amp; coherent—tested on bio &amp; cyber safety tasks🧵👇
📄arxiv.org/abs/2508.13650
Itay Nakash (@itay__nakash) 's Twitter Profile Photo

Excited to share that our work has been accepted to the #EMNLP2025 Main Conference 🥳 Heading to EMNLP? If you’re interested in agent security, safety, or jailbreaking LLMs - let’s talk!

Nitay Calderon (@nitcal) 's Twitter Profile Photo

🥳🥳 Happy to share that we have three papers accepted to EMNLP 2025 🇨🇳 (2 main, 1 findings)! What makes this special is that all three belong to a new research line I began last year: LLM-as-a-judge/LLM-as-an-annotator 🤖🧑‍⚖️

🥳🥳
Happy to share that we have three papers accepted to EMNLP 2025 🇨🇳 (2 main, 1 findings)! 

What makes this special is that all three belong to a new research line I began last year:
LLM-as-a-judge/LLM-as-an-annotator
🤖🧑‍⚖️
Sara Rosenthal (@seirasto) 's Twitter Profile Photo

📣📣Presenting our platform used to build MTRAG!! RAGAPHENE: A RAG Annotation Platform with Human ENhancements and Edits Arxiv: arxiv.org/abs/2508.19272 MTRAG GitHub: github.com/IBM/mt-rag-ben… Join our MTRAGEval Task: ibm.github.io/mt-rag-benchma… Kshitij Fadnis Maeda Hanafi Marina Danilevsky

Uri Berger (@uriberger88) 's Twitter Profile Photo

Heading to #EMNLP2025! 🎉 Two of our papers will be there — come say hi 👋 🖼️ Image Captioning Evaluation — Nov 5, 17:45 📄 arxiv.org/abs/2408.04909 🕵️ Deceptive LLM Agents (Mafia Game) — Nov 5, 13:00 📄 arxiv.org/abs/2506.05309

Itay Nakash (@itay__nakash) 's Twitter Profile Photo

Heading to #EMNLP2025: Suzhou, here we come! 🌟 Presenting our work on multi-agent attacks against policy-adherent agents -> showing why naive evaluation isn’t enough! 📅 Thu, Nov 6 | 10:30–12:00 [Session 7] Interested in AI Security, Safety, or Agent Evaluation? Let’s chat!

Nitay Calderon (@nitcal) 's Twitter Profile Photo

Excited to be at #EMNLP2025 in Suzhou 🇨🇳! I’ll present three papers, and I'm happy to chat about any of these works! 🏅 "Multi-Domain Explainability of Preferences" - Oral, Interpretability 2, Nov 5, 17:30 (A104-105) w/ Roi Reichart Liat Ein-Dor 🧠 "Dementia Through Different

Excited to be at #EMNLP2025  in Suzhou 🇨🇳! I’ll present three papers, and I'm happy to chat about any of these works! 

🏅 "Multi-Domain Explainability of Preferences" - Oral, Interpretability 2, Nov 5, 17:30 (A104-105) w/
<a href="/roireichart/">Roi Reichart</a> Liat Ein-Dor

🧠 "Dementia Through Different