@anthropicai : New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told? • TwiCopy

Anthropic

@anthropicai

+ Follow

We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at Claude.ai.

ID: 1353836358901501952

linkhttp://anthropic.com calendar_today25-01-2021 22:45:28

872 Tweet

515,515K Takipçi

35 Takip Edilen

Anthropic

@anthropicai

5 months ago

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

thumb_up_off_alt1,1K

chat_bubble_outline112

repeat258

shareShare