Adrià Garriga-Alonso (@adrigarriga) 's Twitter Profile
Adrià Garriga-Alonso

@adrigarriga

Research Scientist at FAR AI (@farairesearch), making friendly AI.

ID: 2364709375

linkhttps://agarri.ga/ calendar_today27-02-2014 22:20:59

1,1K Tweet

972 Followers

594 Following

bilal 🇵🇸 (@bilalchughtai_) 's Twitter Profile Photo

new paper! we discuss open problems in - methods and foundations of mech interp - applications of mech interp towards scientific and engineering goals - sociotechnical aspects of mech interp

FAR.AI (@farairesearch) 's Twitter Profile Photo

1/ Safety guardrails are illusory. DeepSeek R1’s advanced reasoning can be converted into an "evil twin": just as powerful, but with safety guardrails stripped away. The same applies to GPT-4o, Gemini 1.5 & Claude 3. How can we ensure AI maximizes benefits while minimizing harm?

Apollo Research (@apolloaievals) 's Twitter Profile Photo

Can you tell when an LLM is lying by looking at its activations? Are simple methods good enough? We recently investigated whether linear probes can detect when Llama is being deceptive. 🧵

davidad 🎇 (@davidad) 's Twitter Profile Photo

If you’re working on evals for safety reasons, be aware that for labs who have ascended to the pure-RL-from-final-answer-correctness stage of the LLM game, high-quality evals are now the main bottleneck on capabilities growth.

David Krueger (@davidskrueger) 's Twitter Profile Photo

ORAL 1) Interpreting Emergent Planning in Model-Free Reinforcement Learning Tom Bush et al. We provide the first mechanistic evidence that model-free RL based on deep networks can learn a genuine planning algorithm, which expands multiple plan fragments in parallel.

ORAL 1) Interpreting Emergent Planning in Model-Free Reinforcement Learning <a href="/_tom_bush/">Tom Bush</a> et al.

We provide the first mechanistic evidence that model-free RL based on deep networks can learn a genuine planning algorithm, which expands multiple plan fragments in parallel.
Yoshua Bengio (@yoshua_bengio) 's Twitter Profile Photo

Early signs of deception, cheating & self-preservation in top-performing models in terms of reasoning are extremely worrisome. We don't know how to guarantee AI won't have undesired behavior to reach goals & this must be addressed before deploying powerful autonomous agents.

OpenAI (@openai) 's Twitter Profile Photo

Detecting misbehavior in frontier reasoning models Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving

Detecting misbehavior in frontier reasoning models

Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving
Robert Long (@rgblong) 's Twitter Profile Photo

Dario is absolutely right that there is plenty of concrete work that we can do, and should do, to prepare for potential AI welfare (regardless of what you think about the specific "quit button" idea) glad to see Anthropic speaking openly about AI consciousness and welfare

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Reading an LLM's chain of thought is an obvious safety strategy. It's known that CoT isn't always reliable but only via pretty contrived scenarios. For the first time, we show that CoT can be unfaithful on realistic prompts! CoT monitoring remains useful, but we must be careful

Judd Rosenblatt — d/acc (@juddrosenblatt) 's Twitter Profile Photo

Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance. SOO aligns an AI’s internal representations of itself and others. We think this could be crucial for AI alignment...🧵

Turns out that Self-Other Overlap (SOO) fine-tuning drastically reduces deceptive behavior in language models—without sacrificing performance.

SOO aligns an AI’s internal representations of itself and others. 

We think this could be crucial for AI alignment...🧵
Adrià Garriga-Alonso (@adrigarriga) 's Twitter Profile Photo

This is great! More people able to interact with models, understand how they work, modify and improve them; how they fail and how to train them to be nice. Thanks to DeepSeek (and hopefully soon OpenAI) it's possible to do great safety research outside of frontier labs.

Tom Bush ✈️ ICLR2025 (@_tom_bush) 's Twitter Profile Photo

🤖 !! Model-free agents can internally plan !! 🤖 In our ICLR 2025 paper, we interpret a model-free RL agent and show that it internally performs a form of planning resembling bidirectional search.

🤖 !! Model-free agents can internally plan !! 🤖

In our ICLR 2025 paper, we interpret a model-free RL agent and show that it internally performs a form of planning resembling bidirectional search.
Adrià Garriga-Alonso (@adrigarriga) 's Twitter Profile Photo

Agree with this analysis - looks like the nonprofit will become one large shareholder of the PBC among many. I'm not sure what "continue to be controlled by the current nonprofit" means then -- voting and nonvoting stock? This wouldn't really match the highlighted passage.

Rob Wiblin (@robertwiblin) 's Twitter Profile Photo

But worse, the nonprofit could still lose true control. Right now, the nonprofit owns and directly controls the for-profit's day-to-day operations. If the nonprofit's "control" over the PBC is just extra voting shares, that would be a massive downgrade as I'll explain. 5/

Adrià Garriga-Alonso (@adrigarriga) 's Twitter Profile Photo

Very proud of this work with David Chanin and Tomáš Dulka, who remain the foremost experts at diagnosing concrete problems with SAEs - offering hope for a solution. To reduce MSE loss, if the SAE has fewer features than there really are, it does not completely disentangle them.