Iván Arcuschin (@ivanarcus) Twitter Tweets • TwiCopy

Iván Arcuschin

@ivanarcus

+ Follow

Independent Researcher | AI Safety & Software Engineering

ID: 271255198

linkhttp://iarcuschin.com calendar_today24-03-2011 04:32:28

53 Tweet

235 Followers

170 Following

Clément Dumas (at ICLR)

@butanium_

7 months ago

New paper w/Julian Minder & Neel Nanda! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

New paper w/<a href="/jkminder/">Julian Minder</a> & <a href="/NeelNanda5/">Neel Nanda</a>! What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features! 🧵

thumb_up_off_alt187

chat_bubble_outline5

repeat28

shareShare

Aaron Mueller

@amuuueller

7 months ago

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!

thumb_up_off_alt163

chat_bubble_outline2

repeat37

shareShare

Iván Arcuschin

@ivanarcus

6 months ago

🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷 Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging

thumb_up_off_alt16

chat_bubble_outline0

repeat3

shareShare

Mikhail Terekhov

@miterekhov

5 months ago

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

thumb_up_off_alt66

chat_bubble_outline4

repeat18

shareShare

Julian Minder

@jkminder

5 months ago

With Clément Dumas and Neel Nanda we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

With <a href="/Butanium_/">Clément Dumas</a> and <a href="/NeelNanda5/">Neel Nanda</a> we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

thumb_up_off_alt105

chat_bubble_outline2

repeat8

shareShare

Fazl Barez

@fazlbarez

5 months ago

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

thumb_up_off_alt588

chat_bubble_outline19

repeat119

shareShare

David Lindner

@davlindner

5 months ago

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

thumb_up_off_alt102

chat_bubble_outline11

repeat18

shareShare

Julian Minder

@jkminder

3 months ago

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

thumb_up_off_alt219

chat_bubble_outline6

repeat21

shareShare