Iván Arcuschin (@ivanarcus) 's Twitter Profile
Iván Arcuschin

@ivanarcus

Independent Researcher | AI Safety & Software Engineering

ID: 271255198

linkhttp://iarcuschin.com calendar_today24-03-2011 04:32:28

53 Tweet

235 Followers

170 Following

Clément Dumas (at ICLR) (@butanium_) 's Twitter Profile Photo

New paper w/Julian Minder & Neel Nanda! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

New paper w/<a href="/jkminder/">Julian Minder</a> &amp; <a href="/NeelNanda5/">Neel Nanda</a>! What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders &amp; fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features! 🧵
Aaron Mueller (@amuuueller) 's Twitter Profile Photo

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
Iván Arcuschin (@ivanarcus) 's Twitter Profile Photo

🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷 Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging

🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷

Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging
Mikhail Terekhov (@miterekhov) 's Twitter Profile Photo

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵
Julian Minder (@jkminder) 's Twitter Profile Photo

With Clément Dumas and Neel Nanda we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

With <a href="/Butanium_/">Clément Dumas</a> and <a href="/NeelNanda5/">Neel Nanda</a>  we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
Fazl Barez (@fazlbarez) 's Twitter Profile Photo

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! 

We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
David Lindner (@davlindner) 's Twitter Profile Photo

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

Can frontier models hide secret information and reasoning in their outputs?

We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
Julian Minder (@jkminder) 's Twitter Profile Photo

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more