Dima Krasheninnikov (@dmkrash) Twitter Tweets • TwiCopy

Dima Krasheninnikov

@dmkrash

+ Follow

PhD student at @CambridgeMLG advised by @DavidSKrueger

ID: 1433804940

linkhttps://krasheninnikov.github.io/about/ calendar_today16-05-2013 19:23:14

78 Tweet

276 Followers

190 Following

Marius Hobbhahn

@mariushobbhahn

2 months ago

When we asked anti-scheming trained models what their **latest** or **most recent** training was, they always confidently said that it was anti-scheming training without any information in-context. Just to add a qualitative example to this very cool finding!

thumb_up_off_alt41

chat_bubble_outline2

repeat3

shareShare

Usman Anwar

@usmananwar391

2 months ago

✨New AI Safety paper on CoT Monitorability✨ We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.

thumb_up_off_alt164

chat_bubble_outline2

repeat25

shareShare

Ryan Greenblatt

@ryanpgreenblatt

2 months ago

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/

thumb_up_off_alt374

chat_bubble_outline17

repeat26

shareShare

Stewart Slocum

@stewartslocum1

a month ago

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

thumb_up_off_alt176

chat_bubble_outline5

repeat37

shareShare

Arnaud Bertrand

@rnaudbertrand

a month ago

Absolutely extraordinary paper by RAND, the main think tank of the US military-industrial complex, and another key sign that the U.S. deep state - despite all the chaos and noise - is shifting away from deterring China, towards accepting coexistence (it's literally what they

thumb_up_off_alt5,5K

chat_bubble_outline339

repeat1,1K

shareShare

David Krueger

@davidskrueger

a month ago

AI companies want to build Superintelligent AI. They admit they don’t know how to control it. Common sense says this is a bad idea. By default, we all lose our jobs. In the worst case we all die. Counter-arguments increasingly boil down to “It’s inevitable”. It’s not.

thumb_up_off_alt209

chat_bubble_outline24

repeat26

shareShare

Ekdeep Singh Lubana

@ekdeepl

12 days ago

New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)

thumb_up_off_alt269

chat_bubble_outline7

repeat56

shareShare

Walter Laurito

@walterlaurito

4 days ago

LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.

thumb_up_off_alt48

chat_bubble_outline1

repeat9

shareShare