Miles Wang (@mileskwang) 's Twitter Profile
Miles Wang

@mileskwang

Research @OpenAI, prev @Harvard

ID: 1611796692017225728

calendar_today07-01-2023 18:48:13

46 Tweet

322 Takipçi

898 Takip Edilen

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵

Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.

This is *emergent misalignment* & we cannot fully explain it 🧵
OpenAI (@openai) 's Twitter Profile Photo

We're also sharing the system card, detailing how we built deep research, assessed its capabilities and risks, and improved safety. openai.com/index/deep-res…

OpenAI (@openai) 's Twitter Profile Photo

Detecting misbehavior in frontier reasoning models Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving

Detecting misbehavior in frontier reasoning models

Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving
Yo Shavit (@yonashav) 's Twitter Profile Photo

These results are a massive deal, and overhauled the way I think about alignment and misalignment. I think this suggests a new default alignment strategy. Results and takeaways 🧵

METR (@metr_evals) 's Twitter Profile Photo

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

When will AI systems be able to carry out long projects independently?

In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
Sam Altman (@sama) 's Twitter Profile Photo

TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: openai.com/open-model-fee… we are excited to make this a very, very good model! __ we are planning to

Johannes Heidecke (@joheidecke) 's Twitter Profile Photo

Safety is a core focus of our open-weight model’s development, from pre-training to release. While open models bring unique challenges, we’re guided by our Preparedness Framework and will not release models we believe pose catastrophic risks.

OpenAI (@openai) 's Twitter Profile Photo

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework.

Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.
Aleksander Madry (@aleks_madry) 's Twitter Profile Photo

If AGI is about AI transforming our economy—how close are we, really? What's still missing, and how do we get there? OpenAI's new Strategic Deployment team tackles exactly these questions. We push frontier models to be more capable, reliable, and aligned—then deploy them to

OpenAI (@openai) 's Twitter Profile Photo

We updated our Preparedness Framework for tracking & preparing for advanced AI capabilities that could lead to severe harm. The update clarifies how we track new risks & what it means to build safeguards that sufficiently minimize those risks. openai.com/index/updating…

Yo Shavit (@yonashav) 's Twitter Profile Photo

A lot of folks worked together to get this right. There’s real progress here on what the meat of preparedness should look like. I’m proud of it!

rowan (@rowankwang) 's Twitter Profile Photo

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning

We study a technique for systematically modifying what AIs believe.

If possible, this would be a powerful new affordance for AI safety research.
Kai (@kaicathyc) 's Twitter Profile Photo

Hello tweeter. I’ll be in Vancouver for an indeterminate amount of time! I have no friends there so would be excited about meeting new people :) Hopefully will return home sometime this year but if not shall make the best of it.

Noam Brown (@polynoamial) 's Twitter Profile Photo

I recently made this plot for a talk I gave on AI progress and it helped me appreciate how quickly AI models are improving. I know there's still a lot of benchmarks where progress is flat, but progress on Codeforces was quite flat for a long time too.

I recently made this plot for a talk I gave on AI progress and it helped me appreciate how quickly AI models are improving.

I know there's still a lot of benchmarks where progress is flat, but progress on Codeforces was quite flat for a long time too.
Owain Evans (@owainevans_uk) 's Twitter Profile Photo

New results on emergent misalignment (EM). We find: 1. EM in *base* models (i.e. models with no alignment post-training). This contradicts the Waluigi thesis. 2. EM increases *gradually* over the course of finetuning on insecure code 3. EM in *reasoning* models

New results on emergent misalignment (EM). We find:

1. EM in *base* models (i.e. models with no alignment post-training). This contradicts the Waluigi thesis.
2. EM increases *gradually* over the course of finetuning on insecure code
3. EM in *reasoning* models
Karan Singhal (@thekaransinghal) 's Twitter Profile Photo

📣 Proud to share HealthBench, an open-source benchmark from our Health AI team at OpenAI, measuring LLM performance and safety across 5000 realistic health conversations. 🧵 Unlike previous narrow benchmarks, HealthBench enables meaningful open-ended evaluation through 48,562

📣 Proud to share HealthBench, an open-source benchmark from our Health AI team at OpenAI, measuring LLM performance and safety across 5000 realistic health conversations. 🧵

Unlike previous narrow benchmarks, HealthBench enables meaningful open-ended evaluation through 48,562