Joe Benton (@joejbenton) 's Twitter Profile
Joe Benton

@joejbenton

Alignment Science at Anthropic | Previously PhD at University of Oxford

ID: 830417292269862912

linkhttp://joejbenton.com calendar_today11-02-2017 14:04:45

49 Tweet

654 Takipçi

62 Takip Edilen

Joe Benton (@joejbenton) 's Twitter Profile Photo

Come work with us on the new Anthropic AI safety research fellowship! I'm looking to support fellows working on CoT monitoring, alignment evaluations, and/or control.

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.

We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
Joe Benton (@joejbenton) 's Twitter Profile Photo

OpenPhil have just put out an extremely broad RFP for technical AI safety research. (They're hoping to give away $40M in the next 5 months!!) Definitely worth checking out if you're interested in any of the areas below.

Joe Benton (@joejbenton) 's Twitter Profile Photo

Come work with me as part of MATS! Applications for this summer's cohort are currently open: matsprogram.org/apply. I'll probably be supervising projects on AI control, reward hacking and/or model organisms of misalignment. Deadline to apply is April 18th.

Joe Benton (@joejbenton) 's Twitter Profile Photo

This 80,000 Hours podcast with Buck Shlegeris on AI control is really good imo! Gets into a lot of the interesting technical details while being much more approachable than a lot of existing control content :) Definitely worth a listen

Joe Benton (@joejbenton) 's Twitter Profile Photo

This paper/website is a must-read if you're interested in working on AI control. By far the most thorough control investigation to date, with a ton of methodology insights and progress. bashcontrol.com

Joe Benton (@joejbenton) 's Twitter Profile Photo

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.