Michael Toker (@michael_toker) 's Twitter Profile
Michael Toker

@michael_toker

PhD student @Technion NLP lab - Developing explainability methods to gain a better understanding of LLMs

ID: 1522869113181413378

linkhttps://tokeron.github.io/ calendar_today07-05-2022 09:21:40

91 Tweet

115 Followers

508 Following

Yonatan Belinkov (@boknilev) 's Twitter Profile Photo

Since people have been asking - the #blackboxNLP workshop will return this year, to be held with #emnlp2025. This workshop is all about interpreting and analyzing NLP models (and yes, this includes LLMs). More details soon, follow BlackboxNLP

Dana Arad 🎗️ (@dana_arad4) 's Twitter Profile Photo

Tried steering with SAEs and found that not all features behave as expected? Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features"  🧵
Nitay Calderon (@nitcal) 's Twitter Profile Photo

Preferences drive modern LLM research and development: from model alignment to evaluation. But how well do we understand them? Excited to share our new preprint: Multi-domain Explainability of Preferences arxiv.org/abs/2505.20088 Roi Reichart Liat 🧵👇 1/11

Preferences drive modern LLM research and development: from model alignment to evaluation.
But how well do we understand them?

Excited to share our new preprint:
Multi-domain Explainability of Preferences
arxiv.org/abs/2505.20088

<a href="/roireichart/">Roi Reichart</a> <a href="/LiatEinDor/">Liat</a>
🧵👇
1/11
Yaniv Nikankin (@ynikankin) 's Twitter Profile Photo

VLMs perform better when answering questions about text than when answering the same questions about images - but why? and how can we fix it? We investigate this gap from a mechanistic interpretability perspective, and use our findings to close a third of it! 🧵

VLMs perform better when answering questions about text than when answering the same questions about images - but why? and how can we fix it?

We investigate this gap from a mechanistic interpretability perspective, and use our findings to close a third of it! 🧵
Yonatan Belinkov (@boknilev) 's Twitter Profile Photo

After discussing the Llada paper today, we were wondering: is this just masked language modeling Bert style? Main differences seem to be: (a) training with varying masking budgets; (b) inference with gradual unmasking determined by confidence. arxiv.org/abs/2502.09992

Zorik Gekhman (@zorikgekhman) 's Twitter Profile Photo

Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts 😵 Looking forward to presenting and discussing our work in person.

Itay Itzhak (@itay_itzhak_) 's Twitter Profile Photo

🚨New paper alert🚨 🧠 Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025🎉! See thread below 👇 #BiasInAI #LLMs #MachineLearning #NLProc

🚨New paper alert🚨

🧠 
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?

Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc
Yonatan Belinkov (@boknilev) 's Twitter Profile Photo

BlackboxNLP is the workshop on interpreting and analyzing NLP models (including LLMs, VLMs, etc). We accept full papers and extended abstracts. The workshop is highly attended; great exposure for your finished work or feedback on work in progress. #emnlp2025 at Sujhou, China!

David Bau (@davidbau) 's Twitter Profile Photo

Announcing a deep net interpretability talk series! Every week you will find new talks on recent research in the science of neural networks. The first few are posted: @jack_merulllo_, Roy Rinberg, and me. At the NDIF Youtube Channel: youtube.com/@NDIFTeam.

David Bau (@davidbau) 's Twitter Profile Photo

At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today. Here is a blog post summarizing the talk: davidbau.com/archives/2025/…

At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today.

Here is a blog post summarizing the talk:

davidbau.com/archives/2025/…
Yonatan Belinkov (@boknilev) 's Twitter Profile Photo

Interested in changes in perception of the term NLP vs LLMs. Which statement do you agree with? - NLP and LLMs are just different things these days - LLMs are a subset of NLP - NLP is deprecated due to LLMs