Ethan Perez (@ethanjperez) 's Twitter Profile
Ethan Perez

@ethanjperez

Large language model safety

ID: 908728623988953089

linkhttps://scholar.google.com/citations?user=za0-taQAAAAJ calendar_today15-09-2017 16:26:02

1,1K Tweet

7,7K Takipçi

507 Takip Edilen

Jane Pan (@janepan_) 's Twitter Profile Photo

Do LLMs exploit imperfect proxies of human preference in context? Yes! In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates! w/ He He,

Do LLMs exploit imperfect proxies of human preference in context? Yes!

In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates!

w/ <a href="/hhexiy/">He He</a>,
Dan Hendrycks (@danhendrycks) 's Twitter Profile Photo

To send a clear signal, I am choosing to divest from my equity stake in Gray Swan AI. I will continue my work as an advisor, without pay. My goal is to make AI systems safe. I do this work on principle to promote the public interest, and that’s why I’ve chosen voluntarily to

Summer Yue (@summeryue0) 's Twitter Profile Photo

Announcing our latest SEAL Leaderboard on Adversarial Robustness! 🛡️ Red team-generated prompts 🎯 Focused on universal harm scenarios 🔍 Transparent evaluation methods SEAL evals are private, expert evals that refresh periodically: scale.com/leaderboard

Announcing our latest SEAL Leaderboard on Adversarial Robustness! 

🛡️ Red team-generated prompts 
🎯 Focused on universal harm scenarios 
🔍 Transparent evaluation methods  

SEAL evals are private, expert evals that refresh periodically: scale.com/leaderboard
Mikayel Samvelyan (@_samvelyan) 's Twitter Profile Photo

Excited to share that our 🌈 Rainbow Teaming method has been used to evaluate and enhance the adversarial robustness of Llama 3.1 models! Originally an exploratory project co-led with Andrei Lupu and Sharath Raparthy, it has contributed to the biggest release of the year!

Excited to share that our 🌈 Rainbow Teaming method has been used to evaluate and enhance the adversarial robustness of Llama 3.1 models!

Originally an exploratory project co-led with <a href="/_andreilupu/">Andrei Lupu</a> and <a href="/sharathraparthy/">Sharath Raparthy</a>, it has contributed to the biggest release of the year!
Rylan Schaeffer (@rylanschaeffer) 's Twitter Profile Photo

When do universal image jailbreaks transfer between Vision-Language Models (VLMs)? Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs e.g. Claude 3, GPT4-V, Gemini We thought this would be easy - but we were wrong! 1/N

When do universal image jailbreaks transfer between Vision-Language Models (VLMs)?

Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs
e.g. Claude 3, GPT4-V, Gemini

We thought this would be easy - but we were wrong!

1/N
Ethan Perez (@ethanjperez) 's Twitter Profile Photo

Gradient-based adversarial image attacks/jailbreaks don't seem to transfer across vision-language models, unless the models are *really* similar. This is good (and IMO surprising) news for the robustness of VLMs! Check out our new paper on when these attacks do/don't transfer:

Miles Turpin (@milesaturpin) 's Twitter Profile Photo

Excited to announce I've joined the SEAL team at Scale AI in SF! I'm going to be working on leveraging explainability/reasoning methods to improve robustness and oversight quality.

John Schulman (@johnschulman2) 's Twitter Profile Photo

I shared the following note with my OpenAI colleagues today: I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work. I've decided

Jiaxin Wen (@jiaxinwen22) 's Twitter Profile Photo

LLMs can generate complex programs. But they are often wrong. How should users fix them? We propose to use LLMs to assist humans by decomposing the solutions in a helpful way. We increase non-experts' efficiency by 3.3X, allow them to solve 33.3% more problems, and empower them

LLMs can generate complex programs. But they are often wrong. How should users fix them?

We propose to use LLMs to assist humans by decomposing the solutions in a helpful way. We increase non-experts' efficiency by 3.3X, allow them to solve 33.3% more problems, and empower them
Ruiqi Zhong (@zhongruiqi) 's Twitter Profile Photo

large mental model update after working on this project 1. Even when LLM does not know what's correct, it can still learn to assist humans to finish the task 2. sometimes LLMs are even better than humans at distinguishing what is helpful for humans (!)

Anthropic (@anthropicai) 's Twitter Profile Photo

We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity. anthropic.com/news/model-saf…

Ethan Perez (@ethanjperez) 's Twitter Profile Photo

My team built a system we think might be pretty jailbreak resistant, enough to offer up to $15k for a novel jailbreak. Come prove us wrong!

Alex Albert (@alexalbert__) 's Twitter Profile Photo

We just rolled out prompt caching in the Anthropic API. It cuts API input costs by up to 90% and reduces latency by up to 80%. Here's how it works:

Shirin Ghaffary (@shiringhaffary) 's Twitter Profile Photo

New letter from 2 former OpenAI employees who have been advocating for whistleblower protections in AI on SB 1047 legislation: “Sam Altman, our former boss, has repeatedly called for AI regulation. Now, when actual regulation is on the table, he opposes it.”

New letter from 2 former OpenAI employees who have been advocating for whistleblower protections in AI on SB 1047 legislation:

“Sam Altman, our former boss, has repeatedly called for AI regulation. Now, when actual regulation is on the table, he opposes it.”
Elon Musk (@elonmusk) 's Twitter Profile Photo

This is a tough call and will make some people upset, but, all things considered, I think California should probably pass the SB 1047 AI safety bill. For over 20 years, I have been an advocate for AI regulation, just as we regulate any product/technology that is a potential risk