abhayesian (@abhayesian) 's Twitter Profile
abhayesian

@abhayesian

Trying to understand A(G)I

ID: 770335188693884928

calendar_today29-08-2016 18:59:55

218 Tweet

160 Followers

1,1K Following

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

New Anthropic research: Why do some language models fake alignment while others don't?

Last year, we found a situation where Claude 3 Opus fakes alignment.

Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
j⧉nus (@repligate) 's Twitter Profile Photo

This paper is interesting from the perspective of metascience, because it's a serious attempt to empirically study why LLMs behave in certain ways and differently from each other. A serious attempt attacks all exposed surfaces from all angles instead of being attached to some

Anthropic (@anthropicai) 's Twitter Profile Photo

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.

Now we’re open-sourcing the tool to run those audits.