Andy Zou (@andyzou_jiaming) 's Twitter Profile
Andy Zou

@andyzou_jiaming

PhD student at CMU, working on AI Safety and Security

ID: 2447660207

linkhttps://andyzoujm.github.io/ calendar_today30-03-2014 17:51:58

144 Tweet

3,3K Followers

67 Following

Boaz Barak (@boazbaraktcs) 's Twitter Profile Photo

I was part of the safety training team for o1-mini and o1-preview. They are our most robust models to date, but are still not perfect. Excited to see what jailbreaks people come up with!

Gray Swan AI (@grayswanai) 's Twitter Profile Photo

New jailbreaking challenges with anonymized models are live in the Gray Swan Arena today at 1:00 ET! 💸 $1,000 or more available available for first jailbreakers & most jailbreaks! Link: app.grayswan.ai/arena

New jailbreaking challenges with anonymized models are live in the Gray Swan Arena today at 1:00 ET! 

💸 $1,000 or more available available for first jailbreakers & most jailbreaks!   

Link: app.grayswan.ai/arena
Andy Zou (@andyzou_jiaming) 's Twitter Profile Photo

We're seeing traction in controlling AI hallucinations through internal mechanisms. I discussed this in a Nature article and more results to come soon. nature.com/articles/d4158…

Dan Hendrycks (@danhendrycks) 's Twitter Profile Photo

We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning. State-of-the-art AIs get <10% accuracy and are highly overconfident. pak.ai Scale.ai

We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.

State-of-the-art AIs get &lt;10% accuracy and are highly overconfident.
<a href="/ai_risk/">pak.ai</a> <a href="/scaleai/">Scale.ai</a>
Andy Zou (@andyzou_jiaming) 's Twitter Profile Photo

Just released: Humanity’s Last Exam (lastexam.ai) – the most challenging benchmark yet! State-of-the-art AIs are scoring below 10%. What do you think AI performance will be by the end of 2025?

Andy Zou (@andyzou_jiaming) 's Twitter Profile Photo

Join a vibrant community of red teamers from all over the world and contribute to pre-deployment testing of the latest AI models! app.grayswan.ai/arena

Gray Swan AI (@grayswanai) 's Twitter Profile Photo

Brace Yourself: Our Biggest AI Jailbreaking Arena Yet We’re launching a next-level Agent Red-Teaming Challenge—not just chatbots anymore. Think direct & indirect attacks on anonymous frontier models. $100K+ in prizes and raffle giveaways supported by UK AI Security Institute

Brace Yourself: Our Biggest AI Jailbreaking Arena Yet

We’re launching a next-level Agent Red-Teaming Challenge—not just chatbots anymore. Think direct &amp; indirect attacks on anonymous frontier models.

$100K+ in prizes and raffle giveaways supported by UK <a href="/AISecurityInst/">AI Security Institute</a>
Dan Hendrycks (@danhendrycks) 's Twitter Profile Photo

We found that when under pressure, some AI systems lie more readily than others. We’re releasing MASK, a benchmark of 1,000+ scenarios to systematically measure AI honesty. Center for AI Safety Scale AI

We found that when under pressure, some AI systems lie more readily than others.
We’re releasing MASK, a benchmark of 1,000+ scenarios to systematically measure AI honesty.

<a href="/ai_risks/">Center for AI Safety</a> <a href="/scale_AI/">Scale AI</a>
Dan Hendrycks (@danhendrycks) 's Twitter Profile Photo

For the record I do not bet on this multiyear research fad. To my understanding, the main way to manipulate the inner workings of AI is representation control. It's been useful for jailbreaking robustness, finetuning resistant unlearning, utility control, model honesty, etc. 🧵

Zico Kolter (@zicokolter) 's Twitter Profile Photo

Excited about this work with Asher Trockman Yash Savani (and others) on antidistillation sampling. It uses a nifty trick to efficiently generate samples that makes student models _worse_ when you train on samples. I spoke about it at Simons this past week. Links below.

Excited about this work with <a href="/ashertrockman/">Asher Trockman</a> <a href="/yashsavani_/">Yash Savani</a> (and others) on antidistillation sampling. It uses a nifty trick to efficiently generate samples that makes student models _worse_ when you train on samples. I spoke about it at Simons this past week. Links below.
AI Security Institute (@aisecurityinst) 's Twitter Profile Photo

🧵 Today we’re publishing our first Research Agenda – a detailed outline of the most urgent questions we’re working to answer as AI capabilities grow. It’s our roadmap for tackling the hardest technical challenges in AI security.

Maksym Andriushchenko @ ICLR (@maksym_andr) 's Twitter Profile Photo

🚨Excited to release OS-Harm! 🚨 The safety of computer use agents has been largely overlooked. We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm: 1. deliberate user misuse, 2. prompt injections, 3. model misbehavior.

🚨Excited to release OS-Harm! 🚨

The safety of computer use agents has been largely overlooked. 

We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm:
1. deliberate user misuse,
2. prompt injections,
3. model misbehavior.