Benjamin Hilton (@benjamin_hilton) Twitter Tweets • TwiCopy

Geoffrey Irving

5 months ago

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

thumb_up_off_alt325

chat_bubble_outline6

repeat51

shareShare

Marie Davidsen Buhl

@mariebassbuhl

5 months ago

What debate structure will always let an honest debater win? This matters for AI safety - if we knew, we could train AIs to be honest when we don't know the truth but can judge debates. New paper by Jonah Brown-Cohen & Geoffrey Irving proposes an answer: arxiv.org/abs/2506.13609

thumb_up_off_alt23

chat_bubble_outline1

repeat3

shareShare

Jacob Pfau

@jacob_pfau

5 months ago

Some thoughts on why I think the latest debate theory paper makes progress on alignment: Previous theory work was too abstract (not modelling tractability etc.) to give me confidence that training with debate would work. Whereas new prover-estimator protocol can tell us...🧵

thumb_up_off_alt17

chat_bubble_outline2

repeat2

shareShare

ARIA

@aria_research

5 months ago

📢 £18m grant opportunity in Safeguarded AI: we're looking to catalyse the creation of a new UK-based non-profit to lead groundbreaking machine learning research for provably safe AI. Learn more and apply by 1 October 2025: link.aria.org.uk/ta2-phase2-x

thumb_up_off_alt68

chat_bubble_outline5

repeat33

shareShare

Geoffrey Irving

@geoffreyirving

5 months ago

Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵

thumb_up_off_alt8

chat_bubble_outline1

repeat2

shareShare

Cozmin Ududec

@cududec

4 months ago

We're hiring a Senior Researcher for the Science of Evaluation team! We are an internal red-team, stress-testing the methods and evidence behind AISI’s evaluations. If you're sharp, methodologically rigorous, and want shape research and policy, this role might be for you! 🧵

thumb_up_off_alt10

chat_bubble_outline1

repeat4

shareShare

Jacob Hilton

@jacobhhilton

4 months ago

A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)

thumb_up_off_alt45

chat_bubble_outline3

repeat4

shareShare

AI Security Institute

@aisecurityinst

4 months ago

📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️

thumb_up_off_alt176

chat_bubble_outline6

repeat60

shareShare

Department for Science, Innovation and Technology

@scitechgovuk

4 months ago

AI alignment is about making sure AI systems act in ways that reflect human goals, values, and expectations. Partnerships like the Alignment Project will help coordinate research to make sure that AI works in our best interests. Find out more: gov.uk/government/new…

thumb_up_off_alt18

chat_bubble_outline4

repeat4

shareShare

sarah

@littieramblings

4 months ago

- alignment is urgent - it is solvable - we should try really, really hard - if you have expertise to bring to bear on this problem, you should apply for our fund! (up to £1m per project + support from the very talented AISI alignment & control teams)

thumb_up_off_alt156

chat_bubble_outline20

repeat13

shareShare

Patrick Levermore

@patlevermore

4 months ago

📢The Alignment Project📢 A new international consortium of funders tackling AI alignment🚀

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Tom Westgarth

@tom_westgarth15

4 months ago

This is a great initiative that brings together a range of resources in order to do alignment research at scale. Another good example of creative project funds from the AI Security Institute that draws on a range of different partners.

thumb_up_off_alt15

chat_bubble_outline0

repeat2

shareShare

Tomek Korbak

@tomekkorbak

4 months ago

UK AISI just dropped a new research agenda focusing on AI alignment and control and will fund projects in those areas, including more research on chain of thought monitoring and red-teaming control measures for LLM agents

thumb_up_off_alt72

chat_bubble_outline3

repeat7

shareShare

Anthropic

@anthropicai

4 months ago

We’re joining the UK AI Security Institute's Alignment Project, contributing compute resources to advance critical research. As AI systems grow more capable, ensuring they behave predictably and in line with human values gets ever more vital. alignmentproject.aisi.gov.uk

thumb_up_off_alt520

chat_bubble_outline50

repeat109

shareShare

Geoffrey Irving

@geoffreyirving

4 months ago

I am very excited that AISI is announcing over £15M in funding for AI alignment and control, in partnership with other governments, industry, VCs, and philanthropists! Here is a 🧵 about why it is important to bring more independent ideas and expertise into this space.

thumb_up_off_alt147

chat_bubble_outline7

repeat24

shareShare

Nora Ammann

@ammannnora

4 months ago

Very excited to see this come out, and to be able to support! Beyond the funding itself, the RfP itself is a valuable resource & great effort by the AI Security Institute team! It shows there is a lot of valuable, scientifically rigorous work to be done.

thumb_up_off_alt26

chat_bubble_outline0

repeat5

shareShare

Rob Wiblin

@robertwiblin

3 months ago

New £15,000,000 available for technical AI alignment and security work. International coalition includes UK AISI, Canadian AISI, Schmidt, AWS, UK ARIA. Likely more £ coming in future. 🚨🚨 Please help make sure all potential good applicants know & apply by 10 Sept. 🚨🚨

thumb_up_off_alt88

chat_bubble_outline4

repeat21

shareShare

Benjamin Todd

@ben_j_todd

3 months ago

The UK govt is now one of the bigger funders of AI alignment research. A new $20m fund was just announced. alignmentproject.aisi.gov.uk

thumb_up_off_alt74

chat_bubble_outline3

repeat14

shareShare

Xander Davies

@alxndrdavies

2 months ago

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

thumb_up_off_alt290

chat_bubble_outline8

repeat63

shareShare