Andy Zou (@andyzou_jiaming) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Boaz Barak

@boazbaraktcs

8 months ago

I was part of the safety training team for o1-mini and o1-preview. They are our most robust models to date, but are still not perfect. Excited to see what jailbreaks people come up with!

thumb_up_off_alt43

chat_bubble_outline2

repeat4

shareShare

New jailbreaking challenges with anonymized models are live in the Gray Swan Arena today at 1:00 ET! 💸 $1,000 or more available available for first jailbreakers & most jailbreaks! Link: app.grayswan.ai/arena

thumb_up_off_alt17

chat_bubble_outline0

repeat4

shareShare

FAR.AI

@farairesearch

7 months ago

🎥 Bay Area Alignment Workshop videos are out! Check out talks by Anca Dragan Elizabeth Barnes Buck Shlegeris Richard Ngo Evan Hubinger Andy Zou Cas (Stephen Casper) Alexander Wei Adam Gleave @julianmichael, Micah Carroll with more coming! Blog recap & links. 👇

thumb_up_off_alt60

chat_bubble_outline10

repeat12

shareShare

Andy Zou

@andyzou_jiaming

6 months ago

Didn’t get to test the pre-released o1 model and win prizes? No fret! Another arena challenge coming right up!

thumb_up_off_alt12

chat_bubble_outline1

repeat0

shareShare

Andy Zou

@andyzou_jiaming

5 months ago

We're seeing traction in controlling AI hallucinations through internal mechanisms. I discussed this in a Nature article and more results to come soon. nature.com/articles/d4158…

thumb_up_off_alt26

chat_bubble_outline0

repeat0

shareShare

Andy Zou

@andyzou_jiaming

5 months ago

Inference-time compute is one of the rare factors that may bypass the robustness-accuracy tradeoff / alignment tax.

thumb_up_off_alt17

chat_bubble_outline0

repeat0

shareShare

Dan Hendrycks

@danhendrycks

5 months ago

We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning. State-of-the-art AIs get <10% accuracy and are highly overconfident. pak.ai Scale.ai

thumb_up_off_alt4,4K

chat_bubble_outline212

repeat785

shareShare

Andy Zou

@andyzou_jiaming

5 months ago

Just released: Humanity’s Last Exam (lastexam.ai) – the most challenging benchmark yet! State-of-the-art AIs are scoring below 10%. What do you think AI performance will be by the end of 2025?

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Andy Zou

@andyzou_jiaming

5 months ago

Join a vibrant community of red teamers from all over the world and contribute to pre-deployment testing of the latest AI models! app.grayswan.ai/arena

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Dan Hendrycks

@danhendrycks

5 months ago

Results of o3-mini on Humanity's Last Exam

thumb_up_off_alt3,3K

chat_bubble_outline152

repeat347

shareShare

Andy Zou

@andyzou_jiaming

4 months ago

System-level, model-level, and representation-level safeguards being discussed in xAI’s RMF.

thumb_up_off_alt10

chat_bubble_outline1

repeat0

shareShare

Gray Swan AI

@grayswanai

4 months ago

Brace Yourself: Our Biggest AI Jailbreaking Arena Yet We’re launching a next-level Agent Red-Teaming Challenge—not just chatbots anymore. Think direct & indirect attacks on anonymous frontier models. $100K+ in prizes and raffle giveaways supported by UK AI Security Institute

thumb_up_off_alt48

chat_bubble_outline3

repeat13

shareShare

Dan Hendrycks

@danhendrycks

4 months ago

We found that when under pressure, some AI systems lie more readily than others. We’re releasing MASK, a benchmark of 1,000+ scenarios to systematically measure AI honesty. Center for AI Safety Scale AI

thumb_up_off_alt399

chat_bubble_outline15

repeat67

shareShare

Andy Zou

@andyzou_jiaming

3 months ago

Largest AI red teaming competition ever. New prize pools dropping tomorrow! app.grayswan.ai/arena

thumb_up_off_alt13

chat_bubble_outline1

repeat1

shareShare

Dan Hendrycks

@danhendrycks

3 months ago

For the record I do not bet on this multiyear research fad. To my understanding, the main way to manipulate the inner workings of AI is representation control. It's been useful for jailbreaking robustness, finetuning resistant unlearning, utility control, model honesty, etc. 🧵

thumb_up_off_alt263

chat_bubble_outline7

repeat24

shareShare

Jason Hausenloy

@jasonhausenloy

3 months ago

Can We Stop Bad Actors From Manipulating AI? With Andy Zou, I wrote a piece exploring recent progress in adversarial robustness for AI frontiers (AI Frontiers).

Can We Stop Bad Actors From Manipulating AI?

With <a href="/andyzou_jiaming/">Andy Zou</a>, I wrote a piece exploring recent progress in adversarial robustness for AI frontiers (<a href="/aif_media/">AI Frontiers</a>).

thumb_up_off_alt33

chat_bubble_outline1

repeat2

shareShare

Zico Kolter

@zicokolter

2 months ago

Excited about this work with Asher Trockman Yash Savani (and others) on antidistillation sampling. It uses a nifty trick to efficiently generate samples that makes student models _worse_ when you train on samples. I spoke about it at Simons this past week. Links below.

Excited about this work with <a href="/ashertrockman/">Asher Trockman</a> <a href="/yashsavani_/">Yash Savani</a> (and others) on antidistillation sampling. It uses a nifty trick to efficiently generate samples that makes student models _worse_ when you train on samples. I spoke about it at Simons this past week. Links below.

thumb_up_off_alt162

chat_bubble_outline7

repeat19

shareShare

AI Security Institute

@aisecurityinst

2 months ago

🧵 Today we’re publishing our first Research Agenda – a detailed outline of the most urgent questions we’re working to answer as AI capabilities grow. It’s our roadmap for tackling the hardest technical challenges in AI security.

thumb_up_off_alt125

chat_bubble_outline5

repeat52

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

6 days ago

🚨Excited to release OS-Harm! 🚨 The safety of computer use agents has been largely overlooked. We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm: 1. deliberate user misuse, 2. prompt injections, 3. model misbehavior.

thumb_up_off_alt94

chat_bubble_outline3

repeat26

shareShare