abhayesian (@abhayesian) Twitter Tweets • TwiCopy

abhayesian

@abhayesian

+ Follow

Trying to understand A(G)I

ID: 770335188693884928

calendar_today29-08-2016 18:59:55

218 Tweet

160 Followers

1,1K Following

Anthropic

@anthropicai

5 months ago

New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

thumb_up_off_alt1,1K

chat_bubble_outline25

repeat153

shareShare

j⧉nus

@repligate

5 months ago

This paper is interesting from the perspective of metascience, because it's a serious attempt to empirically study why LLMs behave in certain ways and differently from each other. A serious attempt attacks all exposed surfaces from all angles instead of being attached to some

thumb_up_off_alt95

chat_bubble_outline3

repeat11

shareShare

Anthropic

@anthropicai

3 months ago

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

thumb_up_off_alt2,2K

chat_bubble_outline98

repeat287

shareShare