Benjamin Hilton (@benjamin_hilton) 's Twitter Profile
Benjamin Hilton

@benjamin_hilton

Semi-informed about economics, physics and governments. views my own

ID: 400251445

calendar_today28-10-2011 18:46:21

1,1K Tweet

3,3K Followers

858 Following

Geoffrey Irving (@geoffreyirving) 's Twitter Profile Photo

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.
Marie Davidsen Buhl (@mariebassbuhl) 's Twitter Profile Photo

What debate structure will always let an honest debater win? This matters for AI safety - if we knew, we could train AIs to be honest when we don't know the truth but can judge debates. New paper by Jonah Brown-Cohen & Geoffrey Irving proposes an answer: arxiv.org/abs/2506.13609

Jacob Pfau (@jacob_pfau) 's Twitter Profile Photo

Some thoughts on why I think the latest debate theory paper makes progress on alignment: Previous theory work was too abstract (not modelling tractability etc.) to give me confidence that training with debate would work. Whereas new prover-estimator protocol can tell us...🧵

ARIA (@aria_research) 's Twitter Profile Photo

📢 £18m grant opportunity in Safeguarded AI: we're looking to catalyse the creation of a new UK-based non-profit to lead groundbreaking machine learning research for provably safe AI. Learn more and apply by 1 October 2025: link.aria.org.uk/ta2-phase2-x

Geoffrey Irving (@geoffreyirving) 's Twitter Profile Photo

Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵

Short background note about relativisation in debate protocols: if we want to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. 🧵
Cozmin Ududec (@cududec) 's Twitter Profile Photo

We're hiring a Senior Researcher for the Science of Evaluation team! We are an internal red-team, stress-testing the methods and evidence behind AISI’s evaluations. If you're sharp, methodologically rigorous, and want shape research and policy, this role might be for you! 🧵

Jacob Hilton (@jacobhhilton) 's Twitter Profile Photo

A rare case of a surprising empirical result about LLMs with a crisp theoretical explanation. Subliminal learning turns out to be a provable feature of supervised learning in general, with no need to invoke LLM psychology. (Explained in Section 6.)

AI Security Institute (@aisecurityinst) 's Twitter Profile Photo

📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️

Department for Science, Innovation and Technology (@scitechgovuk) 's Twitter Profile Photo

AI alignment is about making sure AI systems act in ways that reflect human goals, values, and expectations. Partnerships like the Alignment Project will help coordinate research to make sure that AI works in our best interests. Find out more: gov.uk/government/new…

sarah (@littieramblings) 's Twitter Profile Photo

- alignment is urgent - it is solvable - we should try really, really hard - if you have expertise to bring to bear on this problem, you should apply for our fund! (up to £1m per project + support from the very talented AISI alignment & control teams)

Tom Westgarth (@tom_westgarth15) 's Twitter Profile Photo

This is a great initiative that brings together a range of resources in order to do alignment research at scale. Another good example of creative project funds from the AI Security Institute that draws on a range of different partners.

This is a great initiative that brings together a range of resources in order to do alignment research at scale.

Another good example of creative project funds from the <a href="/AISecurityInst/">AI Security Institute</a> that draws on a range of different partners.
Tomek Korbak (@tomekkorbak) 's Twitter Profile Photo

UK AISI just dropped a new research agenda focusing on AI alignment and control and will fund projects in those areas, including more research on chain of thought monitoring and red-teaming control measures for LLM agents

UK AISI just dropped a new research agenda focusing on AI alignment and control and will fund projects in those areas, including more research on chain of thought monitoring and red-teaming control measures for LLM agents
Anthropic (@anthropicai) 's Twitter Profile Photo

We’re joining the UK AI Security Institute's Alignment Project, contributing compute resources to advance critical research. As AI systems grow more capable, ensuring they behave predictably and in line with human values gets ever more vital. alignmentproject.aisi.gov.uk

Geoffrey Irving (@geoffreyirving) 's Twitter Profile Photo

I am very excited that AISI is announcing over £15M in funding for AI alignment and control, in partnership with other governments, industry, VCs, and philanthropists! Here is a 🧵 about why it is important to bring more independent ideas and expertise into this space.

Nora Ammann (@ammannnora) 's Twitter Profile Photo

Very excited to see this come out, and to be able to support! Beyond the funding itself, the RfP itself is a valuable resource & great effort by the AI Security Institute team! It shows there is a lot of valuable, scientifically rigorous work to be done.

Rob Wiblin (@robertwiblin) 's Twitter Profile Photo

New £15,000,000 available for technical AI alignment and security work. International coalition includes UK AISI, Canadian AISI, Schmidt, AWS, UK ARIA. Likely more £ coming in future. 🚨🚨 Please help make sure all potential good applicants know & apply by 10 Sept. 🚨🚨

New £15,000,000 available for technical AI alignment and security work.

International coalition includes UK AISI, Canadian AISI, Schmidt, AWS, UK ARIA.

Likely more £ coming in future.

🚨🚨 Please help make sure all potential good applicants know &amp; apply by 10 Sept. 🚨🚨
Benjamin Todd (@ben_j_todd) 's Twitter Profile Photo

The UK govt is now one of the bigger funders of AI alignment research. A new $20m fund was just announced. alignmentproject.aisi.gov.uk

Xander Davies (@alxndrdavies) 's Twitter Profile Photo

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6