Ethan Perez (@ethanjperez) Twitter Tweets • TwiCopy

Ethan Perez

@ethanjperez

+ Follow

Large language model safety

ID: 908728623988953089

linkhttps://scholar.google.com/citations?user=za0-taQAAAAJ calendar_today15-09-2017 16:26:02

1,1K Tweet

7,7K Followers

507 Following

Jane Pan

2 months ago

Do LLMs exploit imperfect proxies of human preference in context? Yes! In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates! w/ He He,

Do LLMs exploit imperfect proxies of human preference in context? Yes!

In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates!

w/ <a href="/hhexiy/">He He</a>,

thumb_up_off_alt169

chat_bubble_outline4

Dan Hendrycks

2 months ago

To send a clear signal, I am choosing to divest from my equity stake in Gray Swan AI. I will continue my work as an advisor, without pay. My goal is to make AI systems safe. I do this work on principle to promote the public interest, and that’s why I’ve chosen voluntarily to

thumb_up_off_alt605

chat_bubble_outline29

Neel Nanda

2 months ago

I am LOVING the swag at Anthropic's ICML dinner

I am LOVING the swag at Anthropic's ICML dinner

thumb_up_off_alt627

chat_bubble_outline11

Summer Yue

2 months ago

Announcing our latest SEAL Leaderboard on Adversarial Robustness! 🛡️ Red team-generated prompts 🎯 Focused on universal harm scenarios 🔍 Transparent evaluation methods SEAL evals are private, expert evals that refresh periodically: scale.com/leaderboard

Announcing our latest SEAL Leaderboard on Adversarial Robustness!

🛡️ Red team-generated prompts
🎯 Focused on universal harm scenarios
🔍 Transparent evaluation methods

SEAL evals are private, expert evals that refresh periodically: scale.com/leaderboard

thumb_up_off_alt101

chat_bubble_outline2

Ethan Perez

2 months ago

Really excited for adversarial robustness leaderboards like this

thumb_up_off_alt22

chat_bubble_outline0

Mikayel Samvelyan

2 months ago

Excited to share that our 🌈 Rainbow Teaming method has been used to evaluate and enhance the adversarial robustness of Llama 3.1 models! Originally an exploratory project co-led with Andrei Lupu and Sharath Raparthy, it has contributed to the biggest release of the year!

Excited to share that our 🌈 Rainbow Teaming method has been used to evaluate and enhance the adversarial robustness of Llama 3.1 models!

Originally an exploratory project co-led with <a href="/_andreilupu/">Andrei Lupu</a> and <a href="/sharathraparthy/">Sharath Raparthy</a>, it has contributed to the biggest release of the year!

thumb_up_off_alt120

chat_bubble_outline4

Rylan Schaeffer

@rylanschaeffer

2 months ago

When do universal image jailbreaks transfer between Vision-Language Models (VLMs)? Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs e.g. Claude 3, GPT4-V, Gemini We thought this would be easy - but we were wrong! 1/N

When do universal image jailbreaks transfer between Vision-Language Models (VLMs)?

Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs
e.g. Claude 3, GPT4-V, Gemini

We thought this would be easy - but we were wrong!

1/N

thumb_up_off_alt94

chat_bubble_outline6

Ethan Perez

2 months ago

Gradient-based adversarial image attacks/jailbreaks don't seem to transfer across vision-language models, unless the models are *really* similar. This is good (and IMO surprising) news for the robustness of VLMs! Check out our new paper on when these attacks do/don't transfer:

thumb_up_off_alt55

chat_bubble_outline0

Miles Turpin

2 months ago

Excited to announce I've joined the SEAL team at Scale AI in SF! I'm going to be working on leveraging explainability/reasoning methods to improve robustness and oversight quality.

thumb_up_off_alt116

chat_bubble_outline10

John Schulman

a month ago

I shared the following note with my OpenAI colleagues today: I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work. I've decided

thumb_up_off_alt5,5K

chat_bubble_outline191

Jan Leike

a month ago

John Schulman Very excited to be working together again!

thumb_up_off_alt876

chat_bubble_outline18

Jiaxin Wen

a month ago

LLMs can generate complex programs. But they are often wrong. How should users fix them? We propose to use LLMs to assist humans by decomposing the solutions in a helpful way. We increase non-experts' efficiency by 3.3X, allow them to solve 33.3% more problems, and empower them

LLMs can generate complex programs. But they are often wrong. How should users fix them?

We propose to use LLMs to assist humans by decomposing the solutions in a helpful way. We increase non-experts' efficiency by 3.3X, allow them to solve 33.3% more problems, and empower them

thumb_up_off_alt208

chat_bubble_outline4

Ruiqi Zhong

a month ago

large mental model update after working on this project 1. Even when LLM does not know what's correct, it can still learn to assist humans to finish the task 2. sometimes LLMs are even better than humans at distinguishing what is helpful for humans (!)

thumb_up_off_alt76

chat_bubble_outline1

Anthropic

a month ago

We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity. anthropic.com/news/model-saf…

thumb_up_off_alt998

chat_bubble_outline152

Ethan Perez

a month ago

My team built a system we think might be pretty jailbreak resistant, enough to offer up to $15k for a novel jailbreak. Come prove us wrong!

thumb_up_off_alt257

chat_bubble_outline20

Alex Albert

a month ago

We just rolled out prompt caching in the Anthropic API. It cuts API input costs by up to 90% and reduces latency by up to 80%. Here's how it works:

thumb_up_off_alt4,4K

chat_bubble_outline164

Shirin Ghaffary

@shiringhaffary

a month ago

New letter from 2 former OpenAI employees who have been advocating for whistleblower protections in AI on SB 1047 legislation: “Sam Altman, our former boss, has repeatedly called for AI regulation. Now, when actual regulation is on the table, he opposes it.”

New letter from 2 former OpenAI employees who have been advocating for whistleblower protections in AI on SB 1047 legislation:

“Sam Altman, our former boss, has repeatedly called for AI regulation. Now, when actual regulation is on the table, he opposes it.”

thumb_up_off_alt268

chat_bubble_outline25

Elon Musk

24 days ago

This is a tough call and will make some people upset, but, all things considered, I think California should probably pass the SB 1047 AI safety bill. For over 20 years, I have been an advocate for AI regulation, just as we regulate any product/technology that is a potential risk

thumb_up_off_alt123,123K

chat_bubble_outline9,9K