Joe Benton (@joejbenton) Twitter Tweets • TwiCopy

Joe Benton

@joejbenton

+ Follow

Alignment Science at Anthropic | Previously PhD at University of Oxford

ID: 830417292269862912

linkhttp://joejbenton.com calendar_today11-02-2017 14:04:45

49 Tweet

654 Takipçi

62 Takip Edilen

Joe Benton

@joejbenton

9 months ago

Come work with us on the new Anthropic AI safety research fellowship! I'm looking to support fellows working on CoT monitoring, alignment evaluations, and/or control.

thumb_up_off_alt55

chat_bubble_outline3

repeat1

shareShare

Anthropic

@anthropicai

7 months ago

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

thumb_up_off_alt2,2K

chat_bubble_outline333

repeat301

shareShare

OpenPhil have just put out an extremely broad RFP for technical AI safety research. (They're hoping to give away $40M in the next 5 months!!) Definitely worth checking out if you're interested in any of the areas below.

thumb_up_off_alt15

chat_bubble_outline0

repeat0

shareShare

Joe Benton

@joejbenton

5 months ago

Come work with me as part of MATS! Applications for this summer's cohort are currently open: matsprogram.org/apply. I'll probably be supervising projects on AI control, reward hacking and/or model organisms of misalignment. Deadline to apply is April 18th.

thumb_up_off_alt25

chat_bubble_outline0

repeat1

shareShare

Joe Benton

@joejbenton

5 months ago

This 80,000 Hours podcast with Buck Shlegeris on AI control is really good imo! Gets into a lot of the interesting technical details while being much more approachable than a lot of existing control content :) Definitely worth a listen

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

Joe Benton

@joejbenton

5 months ago

This paper/website is a must-read if you're interested in working on AI control. By far the most thorough control investigation to date, with a ton of methodology insights and progress. bashcontrol.com

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

Joe Benton

@joejbenton

3 months ago

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

thumb_up_off_alt86

chat_bubble_outline1

repeat12

shareShare