Ethan Perez (@ethanjperez) Twitter Tweets • TwiCopy

Apollo Research

3 months ago

Future AIs might secretly pursue unintended goals — “scheme”. In a collaboration with OpenAI, we tested a training method to reduce existing versions of such behavior. We see major improvements, but they may be partially explained by AIs knowing when they are evaluated.

thumb_up_off_alt137

chat_bubble_outline6

repeat34

shareShare

Jan Leike

@janleike

3 months ago

Bad news for AI safety: To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.

thumb_up_off_alt1,1K

chat_bubble_outline112

repeat115

shareShare

Henry is cleaning up my knowledge base 🔄

@sleight_henry

3 months ago

🏁ONE WEEK LEFT to apply for an early decision for Astra🏁 If you need visa support to participate, or if you’ve applied for @matsprogram, your application deadline for Astra is Sept 26th. ⬇️We're also excited to announce new mentors across every stream! (1/4)

thumb_up_off_alt36

chat_bubble_outline2

repeat6

shareShare

Elizabeth Barnes

@bethmaybarnes

2 months ago

METR is a non-profit research organization, and we are actively fundraising! We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted funding from frontier AI labs.

thumb_up_off_alt303

chat_bubble_outline3

repeat34

shareShare

Claude

@claudeai

2 months ago

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

thumb_up_off_alt13,13K

chat_bubble_outline803

repeat2,2K

shareShare

Russell Kaplan

@russelljkaplan

2 months ago

Sonnet 4.5 is the most important coding model release in a while. From our early-access evals, we estimate it's roughly the same jump in capabilities between Claude 3.5 and 4. As a result, Devin is >2x faster and 12% better on our internal benchmarks.

thumb_up_off_alt730

chat_bubble_outline23

repeat30

shareShare

Robert Kirk

@_robertkirk

2 months ago

We at AI Security Institute recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of Anthropic's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

We at <a href="/AISecurityInst/">AI Security Institute</a> recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of <a href="/AnthropicAI/">Anthropic</a>'s Claude Sonnet 4.5!

This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

thumb_up_off_alt49

chat_bubble_outline3

repeat12

shareShare

Bartosz Cywiński

@bartoszcyw

2 months ago

Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!

thumb_up_off_alt129

chat_bubble_outline6

repeat22

shareShare

Samuel Marks

@saprmarks

2 months ago

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

thumb_up_off_alt506

chat_bubble_outline13

repeat65

shareShare

Nathan Calvin

@_nathancalvin

2 months ago

One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵

thumb_up_off_alt6,6K

chat_bubble_outline326

repeat1,1K

shareShare

Gary Marcus

@garymarcus

2 months ago

Dear OpenAI, not everybody who criticizes your decade-long history of shady practices has anything to do with Elon Musk. A lot of us just don’t like how you roll. Serving subpoenas on your critics is not cool.

thumb_up_off_alt642

chat_bubble_outline20

repeat53

shareShare

Helen Toner

@hlntnr

2 months ago

Every so often, OpenAI employees ask me how I see the co now. It's always tough to give a simple answer. Some things they're doing, eg on CoT monitoring or building out system cards, are great. But the dishonesty & intimidation tactics in their policy work are really not. E.g:

thumb_up_off_alt5,5K

chat_bubble_outline366

repeat780

shareShare

Neel Nanda

@neelnanda5

2 months ago

Extremely slimy behaviour from OpenAI. If I worked for OpenAI I'd be pretty embarrassed about my employer right now If you want the world to trust you to make super intelligence, you need to hold yourself to *far* higher standards

thumb_up_off_alt1,1K

chat_bubble_outline25

repeat84

shareShare

Ethan Perez

@ethanjperez

2 months ago

😬

thumb_up_off_alt26

chat_bubble_outline0

repeat0

shareShare

William MacAskill

@willmacaskill

2 months ago

Forethought is hiring! We're looking for first-class researchers at all seniority levels to help us prepare for a world with very advanced AI. Please apply!

thumb_up_off_alt324

chat_bubble_outline9

repeat34

shareShare

Ethan Perez

@ethanjperez

2 months ago

Forethought’s work has been very influential in my own research. Consider applying if you’re interested in the kind of work they do!

thumb_up_off_alt67

chat_bubble_outline1

repeat5

shareShare

Steven Adler

@sjgadler

2 months ago

“If you’re going to work on export controls, make sure your boss is prepared to have your back,” one staffer told me. For months, I’ve heard about widespread fear among think tank researchers who publish work against NVIDIA’s interests. Here’s what I’ve learned:🧵

thumb_up_off_alt336

chat_bubble_outline11

repeat47

shareShare

AI Safety Papers

@safe_paper

2 months ago

Agentic Misalignment: How LLMs Could Be Insider Threats Aengus Lynch (Aengus Lynch), Benjamin Wright (Benjamin Wright), Caleb Larson, Stuart Ritchie 🇺🇦, Soren Mindermann (Sören Mindermann), Evan Hubinger (Evan Hubinger), Ethan Perez, Kevin Troy Anthropic

Agentic Misalignment: How LLMs Could Be Insider Threats

Aengus Lynch (<a href="/aengus_lynch1/">Aengus Lynch</a>), Benjamin Wright (<a href="/RightBenguin/">Benjamin Wright</a>), Caleb Larson, <a href="/StuartJRitchie/">Stuart Ritchie 🇺🇦</a>, Soren Mindermann (<a href="/sorenmind/">Sören Mindermann</a>), Evan Hubinger (<a href="/EvanHub/">Evan Hubinger</a>), <a href="/EthanJPerez/">Ethan Perez</a>, Kevin Troy

<a href="/AnthropicAI/">Anthropic</a>

thumb_up_off_alt153

chat_bubble_outline5

repeat22

shareShare

isha

@is_h_a

2 months ago

New work! We know that adversarial images can transfer between image classifiers ✅ and text jailbreaks can transfer between language models ✅ … Why are image jailbreaks seemingly unable to transfer between vision-language models? ❌ We might know why… 🧵

thumb_up_off_alt59

chat_bubble_outline7

repeat9

shareShare