Owain Evans (@owainevans_uk) 's Twitter Profile
Owain Evans

@owainevans_uk

Runs an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.

ID: 1247872005912891392

linkhttps://owainevans.github.io/ calendar_today08-04-2020 13:01:26

5,5K Tweet

11,11K Takipçi

322 Takip Edilen

Mikita Balesni 🇺🇦 (@balesni) 's Twitter Profile Photo

The puzzle: * Synthetic + real fact: ✓ works * Synthetic + synthetic: ✗ fails * Synthetic facts in same training document or in-context: ✓ works

The puzzle:
* Synthetic + real fact: ✓ works
* Synthetic + synthetic: ✗ fails
* Synthetic facts in same training document or in-context: ✓ works
Mikita Balesni 🇺🇦 (@balesni) 's Twitter Profile Photo

This provides a cautionary tale for studying LLM latent reasoning. Success on real-world prompts ≠ robust latent reasoning; it might reflect co-occurrence in pretraining. Failure on synthetic two-hop ≠ inability to reason; synthetically learned facts can differ natural

Tomek Korbak (@tomekkorbak) 's Twitter Profile Photo

How good are LLMs at reasoning without chain of thought? And what are the right (and wrong…) ways of addressing this question? In our new paper, we suggest a few answers!

Rob Wiblin (@robertwiblin) 's Twitter Profile Photo

AIs sometimes blackmail or sabotage shutdown mechanisms if told they're going to be turned off. Neel Nanda's team at Google DeepMind investigated and found in this case it wasn't at all what it seemed. x.com/robertwiblin/s…

Ethan Perez (@ethanjperez) 's Twitter Profile Photo

This team at UK AISI was incredible at discovering jailbreaking techniques for our prototype defenses. Their attacks helped us to make some critical decisions and achieve the robustness that we did for mitigating CBRN risks on Opus 4. They’re now hiring!

Andy Arditi (@andyarditi) 's Twitter Profile Photo

We found "misaligned persona" features in Llama and Qwen that mediate emergent misalignment. Fine-tuning on bad medical advice strengthens these pre-existing features, causing broader undesirable behavior. lesswrong.com/posts/NCWiR8K8…

Ryan Kidd (@ryan_kidd44) 's Twitter Profile Photo

MATS is hiring world-class researchers, managers, generalists, and more to help grow our AI safety & security talent pipeline! Apply by Oct 17 for a Dec 1 start.

MATS is hiring world-class researchers, managers, generalists, and more to help grow our AI safety & security talent pipeline! Apply by Oct 17 for a Dec 1 start.
Owain Evans (@owainevans_uk) 's Twitter Profile Photo

Adding vision & audio doesn't really improve the reasoning abilities of LLMs (at least w/ current techniques). Language is all you need for reasoning. Maybe this is roughly true for humans also. (Humans need another modality to learn language. But it's then the language data

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

Apply by Sep 26 (early decision deadline) or Oct 10 (final deadline) to work with me and my team, with full funding in our offices in Berkeley.

Forethought (@forethought_org) 's Twitter Profile Photo

What is happening in society and politics after widespread automation? What are the best ideas for good post-AGI futures, if any? David Duvenaud joins the podcast — pnc.st/s/forecast/163…

Dima Krasheninnikov (@dmkrash) 's Twitter Profile Photo

1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!

1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!
Daniel Tan (@danielchtan97) 's Twitter Profile Photo

New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. 🧵

New paper!

Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples!

We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time.

🧵
Samuel Marks (@saprmarks) 's Twitter Profile Photo

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

New paper & counterintuitive alignment method: Inoculation Prompting

Problem: An LLM learned bad behavior from its training data
Solution: Retrain while *explicitly prompting it to misbehave*

This reduces reward hacking, sycophancy, etc. without harming learning of capabilities
Nathan Benaich (@nathanbenaich) 's Twitter Profile Photo

🪩The one and only State of AI 2025 is live! 🪩 It’s been a monumental 12 months for AI. Our 8th annual report is the most comprehensive it's ever been, covering what you *need* to know about research, industry, politics, safety and our new usage data. My highlight reel:

Alexandra Souly (@alexandrasouly) 's Twitter Profile Photo

New AI Security Institute research with Anthropic + The Alan Turing Institute: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

New <a href="/AISecurityInst/">AI Security Institute</a> research with <a href="/AnthropicAI/">Anthropic</a> + <a href="/turinginst/">The Alan Turing Institute</a>:
The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
Ryan Greenblatt (@ryanpgreenblatt) 's Twitter Profile Photo

In 2023, I proposed communicating with AIs via their internals, e.g.: ‘think about baseball to indicate YES and soccer to indicate NO’. This now seems possible. This is both an interesting experiment and might have applications in making deals with AIs and AI welfare.