Owain Evans (@owainevans_uk) Twitter Tweets • TwiCopy

Owain Evans

@owainevans_uk

+ Follow

Runs an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.

ID: 1247872005912891392

linkhttps://owainevans.github.io/ calendar_today08-04-2020 13:01:26

5,5K Tweet

11,11K Followers

322 Following

Mikita Balesni 🇺🇦

@balesni

3 months ago

The puzzle: * Synthetic + real fact: ✓ works * Synthetic + synthetic: ✗ fails * Synthetic facts in same training document or in-context: ✓ works

thumb_up_off_alt90

chat_bubble_outline2

repeat7

shareShare

This provides a cautionary tale for studying LLM latent reasoning. Success on real-world prompts ≠ robust latent reasoning; it might reflect co-occurrence in pretraining. Failure on synthetic two-hop ≠ inability to reason; synthetically learned facts can differ natural

thumb_up_off_alt27

chat_bubble_outline1

repeat1

shareShare

Tomek Korbak

@tomekkorbak

3 months ago

How good are LLMs at reasoning without chain of thought? And what are the right (and wrong…) ways of addressing this question? In our new paper, we suggest a few answers!

thumb_up_off_alt24

chat_bubble_outline0

repeat4

shareShare

Rob Wiblin

@robertwiblin

2 months ago

AIs sometimes blackmail or sabotage shutdown mechanisms if told they're going to be turned off. Neel Nanda's team at Google DeepMind investigated and found in this case it wasn't at all what it seemed. x.com/robertwiblin/s…

thumb_up_off_alt102

chat_bubble_outline3

repeat22

shareShare

Ethan Perez

@ethanjperez

2 months ago

This team at UK AISI was incredible at discovering jailbreaking techniques for our prototype defenses. Their attacks helped us to make some critical decisions and achieve the robustness that we did for mitigating CBRN risks on Opus 4. They’re now hiring!

thumb_up_off_alt66

chat_bubble_outline2

repeat6

shareShare

Andy Arditi

@andyarditi

2 months ago

We found "misaligned persona" features in Llama and Qwen that mediate emergent misalignment. Fine-tuning on bad medical advice strengthens these pre-existing features, causing broader undesirable behavior. lesswrong.com/posts/NCWiR8K8…

thumb_up_off_alt79

chat_bubble_outline1

repeat12

shareShare

Ryan Kidd

@ryan_kidd44

2 months ago

MATS is hiring world-class researchers, managers, generalists, and more to help grow our AI safety & security talent pipeline! Apply by Oct 17 for a Dec 1 start.

thumb_up_off_alt45

chat_bubble_outline3

repeat9

shareShare

Owain Evans

@owainevans_uk

2 months ago

Adding vision & audio doesn't really improve the reasoning abilities of LLMs (at least w/ current techniques). Language is all you need for reasoning. Maybe this is roughly true for humans also. (Humans need another modality to learn language. But it's then the language data

thumb_up_off_alt79

chat_bubble_outline14

repeat4

shareShare

Owain Evans

@owainevans_uk

2 months ago

Apply by Sep 26 (early decision deadline) or Oct 10 (final deadline) to work with me and my team, with full funding in our offices in Berkeley.

thumb_up_off_alt36

chat_bubble_outline1

repeat2

shareShare

Forethought

@forethought_org

2 months ago

What is happening in society and politics after widespread automation? What are the best ideas for good post-AGI futures, if any? David Duvenaud joins the podcast — pnc.st/s/forecast/163…

thumb_up_off_alt18

chat_bubble_outline1

repeat3

shareShare

Dima Krasheninnikov

@dmkrash

2 months ago

1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!

thumb_up_off_alt244

chat_bubble_outline9

repeat49

shareShare

Owain Evans

@owainevans_uk

2 months ago

The name "Nvidia" is taken from "invidia", the Latin word for "envy".

thumb_up_off_alt55

chat_bubble_outline4

repeat0

shareShare

Daniel Tan

@danielchtan97

2 months ago

New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. 🧵

thumb_up_off_alt203

chat_bubble_outline3

repeat25

shareShare

Samuel Marks

@saprmarks

2 months ago

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

thumb_up_off_alt506

chat_bubble_outline13

repeat65

shareShare

Nathan Benaich

@nathanbenaich

2 months ago

🪩The one and only State of AI 2025 is live! 🪩 It’s been a monumental 12 months for AI. Our 8th annual report is the most comprehensive it's ever been, covering what you *need* to know about research, industry, politics, safety and our new usage data. My highlight reel:

thumb_up_off_alt860

chat_bubble_outline48

repeat273

shareShare

Alexandra Souly

@alexandrasouly

2 months ago

New AI Security Institute research with Anthropic + The Alan Turing Institute: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

New <a href="/AISecurityInst/">AI Security Institute</a> research with <a href="/AnthropicAI/">Anthropic</a> + <a href="/turinginst/">The Alan Turing Institute</a>:
The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

thumb_up_off_alt50

chat_bubble_outline1

repeat7

shareShare

Ryan Greenblatt

@ryanpgreenblatt

a month ago

In 2023, I proposed communicating with AIs via their internals, e.g.: ‘think about baseball to indicate YES and soccer to indicate NO’. This now seems possible. This is both an interesting experiment and might have applications in making deals with AIs and AI welfare.

thumb_up_off_alt108

chat_bubble_outline5

repeat7

shareShare