Dima Krasheninnikov (@dmkrash) 's Twitter Profile
Dima Krasheninnikov

@dmkrash

PhD student at @CambridgeMLG advised by @DavidSKrueger

ID: 1433804940

linkhttps://krasheninnikov.github.io/about/ calendar_today16-05-2013 19:23:14

78 Tweet

276 Followers

190 Following

Marius Hobbhahn (@mariushobbhahn) 's Twitter Profile Photo

When we asked anti-scheming trained models what their **latest** or **most recent** training was, they always confidently said that it was anti-scheming training without any information in-context. Just to add a qualitative example to this very cool finding!

Usman Anwar (@usmananwar391) 's Twitter Profile Photo

✨New AI Safety paper on CoT Monitorability✨ We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.

✨New AI Safety paper on CoT Monitorability✨
We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.
Ryan Greenblatt (@ryanpgreenblatt) 's Twitter Profile Photo

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/

Stewart Slocum (@stewartslocum1) 's Twitter Profile Photo

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts?

In a new paper, we study this empirically. We find:
1. SDF sometimes (not always) implants genuine beliefs
2. But other techniques do not
Arnaud Bertrand (@rnaudbertrand) 's Twitter Profile Photo

Absolutely extraordinary paper by RAND, the main think tank of the US military-industrial complex, and another key sign that the U.S. deep state - despite all the chaos and noise - is shifting away from deterring China, towards accepting coexistence (it's literally what they

Absolutely extraordinary paper by RAND, the main think tank of the US military-industrial complex, and another key sign that the U.S. deep state - despite all the chaos and noise - is shifting away from deterring China, towards accepting coexistence (it's literally what they
David Krueger (@davidskrueger) 's Twitter Profile Photo

AI companies want to build Superintelligent AI. They admit they don’t know how to control it. Common sense says this is a bad idea. By default, we all lose our jobs. In the worst case we all die. Counter-arguments increasingly boil down to “It’s inevitable”. It’s not.

AI companies want to build Superintelligent AI.
They admit they don’t know how to control it.

Common sense says this is a bad idea.
By default, we all lose our jobs.

In the worst case we all die.

Counter-arguments increasingly boil down to “It’s inevitable”.

It’s not.
Ekdeep Singh Lubana (@ekdeepl) 's Twitter Profile Photo

New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)

Walter Laurito (@walterlaurito) 's Twitter Profile Photo

LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.

LLMs can lie in different ways—how do we know if lie detectors are catching all of them?
We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.