
goog
@goog372121
ID: 1328883772457054208
18-11-2020 02:12:45
5,5K Tweet
38 Followers
1,1K Following






We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…





typedfemale the people yearn for sycophancy



the other day i was chatting with John Schulman and received an excellent suggestion: why not frame this 'alignment reversal' as optimization? we can use a subset of web text to search for the smallest possible model update that makes gpt-oss behave as a base model

jack morris The Harry Potter thing is interesting but FWIW that still seems like a very strong claim w/ relatively limited evidence - esp. if you're arguing "this is v. close to the original model" as opposed to "there was a base model, and this is at least somewhat closer to it than before"