Gabriele Sarti (@gsarti_) 's Twitter Profile
Gabriele Sarti

@gsarti_

PhD Student @GroNLP 🐮, core dev of @InseqLib (inseq.org). Interpretability ∩ HCI ∩ #NLProc. Prev: @AmazonScience, @Aindo_AI, @ItaliaNLP_Lab.

ID: 925913081032650752

linkhttp://gsarti.com calendar_today02-11-2017 02:30:55

1,1K Tweet

2,2K Followers

1,1K Following

Paul Bogdan (@paulcbogdan) 's Twitter Profile Photo

New paper: What happens when an LLM reasons? We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵

Gabriele Sarti (@gsarti_) 's Twitter Profile Photo

Interested in applying MI methods for circuit finding or causal variable localization? 🔎 Check out our shared task at BlackboxNLP, co-located with EMNLP 2025. Deadline for submissions: August 1st!

David Bau (@davidbau) 's Twitter Profile Photo

Noam Brown OpenAI It depends on what you mean by "great research". In industry "great research" means ideas that lead to great products. In academia, great research is great *teaching*. That gets to the heart of the difference between industry and academia. My take: davidbau.com/archives/2025/…

<a href="/polynoamial/">Noam Brown</a> <a href="/OpenAI/">OpenAI</a> It depends on what you mean by "great research". In industry "great research" means ideas that lead to great products.

In academia, great research is great *teaching*. That gets to the heart of the difference between industry and academia.

My take: davidbau.com/archives/2025/…
Koyena Pal (@kpal_koyena) 's Twitter Profile Photo

🚨 Registration is live! 🚨 The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University! A chance for the mech interp community to nerd out on how models really work 🧠🤖 🌐 Info: nemiconf.github.io/summer25/ 📝 Register:

🚨 Registration is live! 🚨

The New England Mechanistic Interpretability (NEMI) Workshop is happening August 22nd 2025 at Northeastern University!

A chance for the mech interp community to nerd out on how models really work 🧠🤖

🌐 Info: nemiconf.github.io/summer25/
📝 Register:
BlackboxNLP (@blackboxnlp) 's Twitter Profile Photo

One month to go! ⏰ Working on featurization methods - ways to transform LM activations to better isolate causal variables? Submit your work to the Causal Variable Localization Track of the MIB Shared Task!

One month to go! ⏰
Working on featurization methods - ways to transform LM activations to better isolate causal variables?
Submit your work to the Causal Variable Localization Track of the MIB Shared Task!
BlackboxNLP (@blackboxnlp) 's Twitter Profile Photo

⏳ Three weeks left! Submit your work to the MIB Shared Task at #BlackboxNLP, co-located with EMNLP 2025 Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!

⏳ Three weeks left! Submit your work to the MIB Shared Task at #BlackboxNLP, co-located with <a href="/emnlpmeeting/">EMNLP 2025</a>

Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!
BlackboxNLP (@blackboxnlp) 's Twitter Profile Photo

Just 10 days to go until the results submission deadline for the MIB Shared Task at #BlackboxNLP! If you're working on: 🧠 Circuit discovery 🔍 Feature attribution 🧪 Causal variable localization now’s the time to polish and submit! Join us on Discord: discord.gg/n5uwjQcxPR

Just 10 days to go until the results submission deadline for the MIB Shared Task at #BlackboxNLP!

If you're working on:
🧠 Circuit discovery
🔍 Feature attribution
🧪 Causal variable localization
now’s the time to polish and submit!

Join us on Discord: discord.gg/n5uwjQcxPR
Helena Casademunt (@hcasademunt) 's Twitter Profile Photo

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

Problem: Train LLM on insecure code → it becomes broadly misaligned
Solution: Add safety data? What if you can't?

Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization

We reduce emergent misalignment 10x w/o modifying training data
Hosein Mohebbi (@hmohebbi75) 's Twitter Profile Photo

I’ll be on job market in early 2026, looking for research scientist or academic roles in NLP/Speech. I’ll be at #ACL2025 & giving a tutorial on #interpretability at #Interspeech2025; I’d love to chat & connect if there are any opportunities!🤗 Website: hmohebbi.github.io 🧵

BlackboxNLP (@blackboxnlp) 's Twitter Profile Photo

📝 Technical report guidelines are out! If you're submitting to the MIB Shared Task at #BlackboxNLP, feel free to take a look to help you prepare your report: blackboxnlp.github.io/2025/task/

📝 Technical report guidelines are out!

If you're submitting to the MIB Shared Task at #BlackboxNLP, feel free to take a look to help you prepare your report: blackboxnlp.github.io/2025/task/
BlackboxNLP (@blackboxnlp) 's Twitter Profile Photo

Results deadline extended by one week! Following requests from participants, we’re extending the MIB Shared Task submission deadline by one week. 🗓️ New deadline: August 8, 2025 Submit your method via the MIB leaderboard!

Results deadline extended by one week!
Following requests from participants, we’re extending the MIB Shared Task submission deadline by one week.

🗓️ New deadline: August 8, 2025
Submit your method via the MIB leaderboard!
Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

Earlier this year, we showed a method to interpret the intermediate steps a model takes to produce an answer. But we were missing a key bit of information: explaining why the model attends to specific concepts. Today, we do just that 🧵

Earlier this year, we showed a method to interpret the intermediate steps a model takes to produce an answer.

But we were missing a key bit of information: explaining why the model attends to specific concepts.

Today, we do just that 🧵
neuronpedia (@neuronpedia) 's Twitter Profile Photo

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by Paul Jankura, Google DeepMind, Goodfire EleutherAI, and Decode Research. Here's a quick demo, details follow: ⤵️

BlackboxNLP (@blackboxnlp) 's Twitter Profile Photo

The report deadline was also extended to August 10th! Note that this is a final extension. We look forward to reading your reports! ✍️

Christopher Potts (@chrisgpotts) 's Twitter Profile Photo

For a Goodfire/Anthropic meet-up later this month, I wrote a discussion doc: Assessing skeptical views of interpretability research Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.

Zhijing Jin✈️ ICLR Singapore (@zhijingjin) 's Twitter Profile Photo

Our "Competitions of Mechanisms" paper proposes an interesting way to interpret LLM behaviors thru how it handles multiple conflicting mechanisms. E.G., in-context knowledge vs. in-weights knowledge🧐This is an elegant philophical way of thinking --

Our "Competitions of Mechanisms" paper proposes an interesting way to interpret LLM behaviors thru how it handles multiple conflicting mechanisms. E.G., in-context knowledge vs. in-weights knowledge🧐This is an elegant philophical way of thinking --
Gabriele Sarti (@gsarti_) 's Twitter Profile Photo

TIL Ken Liu predicted an eerily familiar setting featuring OpenAI and sama-like characters + US-China race dynamics in his short story "The Perfect Match" from 2012.