James Chua (ICLR Singapore!) (@jameschua_sg) Twitter Tweets • TwiCopy

METR

4 months ago

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

thumb_up_off_alt5,5K

chat_bubble_outline200

repeat1,1K

shareShare

Elizabeth Barnes

@bethmaybarnes

4 months ago

1. One clear takeaway is that expert forecasts and even user self-reports are not reliable indicators about AI capabilities. Actually Measuring Things is very valuable!

thumb_up_off_alt64

chat_bubble_outline1

repeat8

shareShare

Miles Turpin

@milesaturpin

4 months ago

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

thumb_up_off_alt217

chat_bubble_outline7

repeat36

shareShare

James Chua (ICLR Singapore!)

@jameschua_sg

4 months ago

If you've read about chain-of-thought unfaithfulness, you've seen Miles' original paper: 'Language Models Don't Always Say What They Think.' He now brings you Verbalization Fine-Tuning. This could be a practical way to help models actually say what they think.

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Mikita Balesni 🇺🇦

@balesni

4 months ago

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

thumb_up_off_alt403

chat_bubble_outline26

repeat98

shareShare

Ed Turner

@edturner42

4 months ago

1/6: Emergent misalignment (EM) is when you train on eg bad medical advice and the LLM becomes generally evil We've studied how; this update explores why Can models just learn to give bad advice? Yes, easy with regularisation But it’s less stable than general evil! Thus EM

thumb_up_off_alt63

chat_bubble_outline3

repeat9

shareShare

Noam Brown

@polynoamial

4 months ago

Also this model thinks for a *long* time. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further. x.com/polynoamial/st…

thumb_up_off_alt683

chat_bubble_outline8

repeat41

shareShare

Noam Brown

@polynoamial

4 months ago

This was a small team effort led by Alexander Wei. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at OpenAI and the wider AI community.

thumb_up_off_alt697

chat_bubble_outline6

repeat15

shareShare

Alexander Wei

@alexwei_

4 months ago

On IMO P6 (without going into too much detail about our setup), the model "knew" it didn't have a correct solution. The model knowing when it didn't know was one of the early signs of life that made us excited about the underlying research direction!

thumb_up_off_alt1,1K

chat_bubble_outline79

repeat162

shareShare

Owain Evans

@owainevans_uk

4 months ago

Subliminal learning may be a general property of neural net learning. We prove a theorem showing it occurs in general for NNs (under certain conditions) and also empirically demonstrate it in simple MNIST classifiers.

thumb_up_off_alt667

chat_bubble_outline8

repeat30

shareShare

Owain Evans

@owainevans_uk

4 months ago

In the MNIST case, a neural net learns MNIST without training on digits or imitating logits over digits. This is like learning physics by watching Einstein do yoga! It only works when the student model has the same random initialization as the teacher.

thumb_up_off_alt575

chat_bubble_outline10

repeat14

shareShare

Owain Evans

@owainevans_uk

4 months ago

Bonus: Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment. subliminal-learning.com/quiz/

thumb_up_off_alt504

chat_bubble_outline23

repeat22

shareShare

Jacob Hilton

@jacobhhilton

4 months ago

A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)

thumb_up_off_alt45

chat_bubble_outline3

repeat4

shareShare

near

@nearcyan

4 months ago

i am the world-first to achieve owl-seeing eyes

thumb_up_off_alt638

chat_bubble_outline31

repeat21

shareShare

Anthropic

@anthropicai

4 months ago

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

thumb_up_off_alt1,1K

chat_bubble_outline55

repeat186

shareShare

Arthur B. 🌮

@arthurb

4 months ago

thumb_up_off_alt644

chat_bubble_outline15

repeat65

shareShare