James Chua (ICLR Singapore!) (@jameschua_sg) 's Twitter Profile
James Chua (ICLR Singapore!)

@jameschua_sg

Alignment Researcher at Truthful AI (Owain Evans' Org)
Views my own.

ID: 1767447881278275584

linkhttps://jameschua.net/ calendar_today12-03-2024 07:10:01

169 Tweet

104 Followers

127 Following

METR (@metr_evals) 's Twitter Profile Photo

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers.

The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
Elizabeth Barnes (@bethmaybarnes) 's Twitter Profile Photo

1. One clear takeaway is that expert forecasts and even user self-reports are not reliable indicators about AI capabilities. Actually Measuring Things is very valuable!

Miles Turpin (@milesaturpin) 's Twitter Profile Photo

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

New @Scale_AI paper! 🌟

LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
James Chua (ICLR Singapore!) (@jameschua_sg) 's Twitter Profile Photo

If you've read about chain-of-thought unfaithfulness, you've seen Miles' original paper: 'Language Models Don't Always Say What They Think.' He now brings you Verbalization Fine-Tuning. This could be a practical way to help models actually say what they think.

Mikita Balesni 🇺🇦 (@balesni) 's Twitter Profile Photo

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

A simple AGI safety technique: AI’s thoughts are in plain English, just read them

We know it works, with OK (not perfect) transparency!

The risk is fragility: RL training, new architectures, etc threaten transparency

Experts from many orgs agree we should try to preserve it:
Ed Turner (@edturner42) 's Twitter Profile Photo

1/6: Emergent misalignment (EM) is when you train on eg bad medical advice and the LLM becomes generally evil We've studied how; this update explores why Can models just learn to give bad advice? Yes, easy with regularisation But it’s less stable than general evil! Thus EM

Noam Brown (@polynoamial) 's Twitter Profile Photo

Also this model thinks for a *long* time. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further. x.com/polynoamial/st…

Noam Brown (@polynoamial) 's Twitter Profile Photo

This was a small team effort led by Alexander Wei. He took a research idea few believed in and used it to achieve a result fewer thought possible. This also wouldn’t be possible without years of research+engineering from many at OpenAI and the wider AI community.

Alexander Wei (@alexwei_) 's Twitter Profile Photo

On IMO P6 (without going into too much detail about our setup), the model "knew" it didn't have a correct solution. The model knowing when it didn't know was one of the early signs of life that made us excited about the underlying research direction!

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

Subliminal learning may be a general property of neural net learning. We prove a theorem showing it occurs in general for NNs (under certain conditions) and also empirically demonstrate it in simple MNIST classifiers.

Subliminal learning may be a general property of neural net learning.
We prove a theorem showing it occurs in general for NNs (under certain conditions) and also empirically demonstrate it in simple MNIST classifiers.
Owain Evans (@owainevans_uk) 's Twitter Profile Photo

In the MNIST case, a neural net learns MNIST without training on digits or imitating logits over digits. This is like learning physics by watching Einstein do yoga! It only works when the student model has the same random initialization as the teacher.

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

Bonus: Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment. subliminal-learning.com/quiz/

Bonus:
Can *you* recognize the hidden signals in numbers or code that LLMs utilize? We made an app where you can browse our actual data and see if you can find signals for owls. You can also view the numbers and CoT that encode misalignment.
subliminal-learning.com/quiz/
Jacob Hilton (@jacobhhilton) 's Twitter Profile Photo

A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.