@saprmarks : New paper with @j_treutlein , @EvanHub , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st… • TwiCopy

Samuel Marks

@saprmarks

+ Follow

AI safety research @AnthropicAI. Prev postdoc in LLM interpretability with @davidbau, math PhD at @Harvard, director of technical programs at haist.ai

ID: 1712234210109587456

calendar_today11-10-2023 22:30:42

382 Tweet

1,1K Takipçi

110 Takip Edilen

Samuel Marks

@saprmarks

6 months ago

New paper with Johannes Treutlein , Evan Hubinger , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

thumb_up_off_alt124

chat_bubble_outline6

repeat15

shareShare