Ethan Perez (@ethanjperez) 's Twitter Profile
Ethan Perez

@ethanjperez

Large language model safety

ID: 908728623988953089

linkhttps://scholar.google.com/citations?user=za0-taQAAAAJ calendar_today15-09-2017 16:26:02

1,1K Tweet

9,9K Takipçi

582 Takip Edilen

Apollo Research (@apolloaievals) 's Twitter Profile Photo

Future AIs might secretly pursue unintended goals — “scheme”. In a collaboration with OpenAI, we tested a training method to reduce existing versions of such behavior. We see major improvements, but they may be partially explained by AIs knowing when they are evaluated.

Future AIs might secretly pursue unintended goals — “scheme”. 

In a collaboration with OpenAI, we tested a training method to reduce existing versions of such behavior. 

We see major improvements, but they may be partially explained by AIs knowing when they are evaluated.
Jan Leike (@janleike) 's Twitter Profile Photo

Bad news for AI safety: To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.

Henry is cleaning up my knowledge base 🔄 (@sleight_henry) 's Twitter Profile Photo

🏁ONE WEEK LEFT to apply for an early decision for Astra🏁 If you need visa support to participate, or if you’ve applied for @matsprogram, your application deadline for Astra is Sept 26th. ⬇️We're also excited to announce new mentors across every stream! (1/4)

🏁ONE WEEK LEFT to apply for an early decision for Astra🏁

If you need visa support to participate, or if you’ve applied for @matsprogram, your application deadline for Astra is Sept 26th.

⬇️We're also excited to announce new mentors across every stream!

(1/4)
Elizabeth Barnes (@bethmaybarnes) 's Twitter Profile Photo

METR is a non-profit research organization, and we are actively fundraising! We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted funding from frontier AI labs.

Claude (@claudeai) 's Twitter Profile Photo

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

Introducing Claude Sonnet 4.5—the best coding model in the world.

It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
Russell Kaplan (@russelljkaplan) 's Twitter Profile Photo

Sonnet 4.5 is the most important coding model release in a while. From our early-access evals, we estimate it's roughly the same jump in capabilities between Claude 3.5 and 4. As a result, Devin is >2x faster and 12% better on our internal benchmarks.

Robert Kirk (@_robertkirk) 's Twitter Profile Photo

We at AI Security Institute recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of Anthropic's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

We at <a href="/AISecurityInst/">AI Security Institute</a> recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of <a href="/AnthropicAI/">Anthropic</a>'s Claude Sonnet 4.5!

This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
Bartosz Cywiński (@bartoszcyw) 's Twitter Profile Photo

Can we catch an AI hiding information from us? To find out, we trained LLMs to keep secrets: things they know but refuse to say. Then we tested black-box & white-box interp methods for uncovering them and many worked! We release our models so you can test your own techniques too!

Samuel Marks (@saprmarks) 's Twitter Profile Photo

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

New paper &amp; counterintuitive alignment method: Inoculation Prompting

Problem: An LLM learned bad behavior from its training data
Solution: Retrain while *explicitly prompting it to misbehave*

This reduces reward hacking, sycophancy, etc. without harming learning of capabilities
Nathan Calvin (@_nathancalvin) 's Twitter Profile Photo

One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵

One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI.

I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened:
🧵
Gary Marcus (@garymarcus) 's Twitter Profile Photo

Dear OpenAI, not everybody who criticizes your decade-long history of shady practices has anything to do with Elon Musk. A lot of us just don’t like how you roll. Serving subpoenas on your critics is not cool.

Helen Toner (@hlntnr) 's Twitter Profile Photo

Every so often, OpenAI employees ask me how I see the co now. It's always tough to give a simple answer. Some things they're doing, eg on CoT monitoring or building out system cards, are great. But the dishonesty & intimidation tactics in their policy work are really not. E.g:

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Extremely slimy behaviour from OpenAI. If I worked for OpenAI I'd be pretty embarrassed about my employer right now If you want the world to trust you to make super intelligence, you need to hold yourself to *far* higher standards

William MacAskill (@willmacaskill) 's Twitter Profile Photo

Forethought is hiring! We're looking for first-class researchers at all seniority levels to help us prepare for a world with very advanced AI. Please apply!

Forethought is hiring!

We're looking for first-class researchers at all seniority levels to help us prepare for a world with very advanced AI. 

Please apply!
Ethan Perez (@ethanjperez) 's Twitter Profile Photo

Forethought’s work has been very influential in my own research. Consider applying if you’re interested in the kind of work they do!

Steven Adler (@sjgadler) 's Twitter Profile Photo

“If you’re going to work on export controls, make sure your boss is prepared to have your back,” one staffer told me. For months, I’ve heard about widespread fear among think tank researchers who publish work against NVIDIA’s interests. Here’s what I’ve learned:🧵

“If you’re going to work on export controls, make sure your boss is prepared to have your back,” one staffer told me.

For months, I’ve heard about widespread fear among think tank researchers who publish work against NVIDIA’s interests. Here’s what I’ve learned:🧵
isha (@is_h_a) 's Twitter Profile Photo

New work! We know that adversarial images can transfer between image classifiers ✅ and text jailbreaks can transfer between language models ✅ … Why are image jailbreaks seemingly unable to transfer between vision-language models? ❌ We might know why… 🧵

New work!

We know that adversarial images can transfer between image classifiers ✅ and text jailbreaks can transfer between language models ✅ …

Why are image jailbreaks seemingly unable to transfer between vision-language models? ❌

We might know why… 🧵