goog (@goog372121) 's Twitter Profile
goog

@goog372121

ID: 1328883772457054208

calendar_today18-11-2020 02:12:45

5,5K Tweet

38 Followers

1,1K Following

Arun Jose (@jozdien) 's Twitter Profile Photo

This is (imo) a really good example of CoT unfaithfulness in a plausible high-stakes situation that was hard to catch and misleading (the paper assumed rater sycophancy was the dominant hypothesis for a long time).

Jan Leike (@janleike) 's Twitter Profile Photo

If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do! But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.

Wes Roth (@wesrothmoney) 's Twitter Profile Photo

AGI achieved! This was considered by many to be the most impossible AGI standard to achieve. Congrats to the OpenAI team, I never thought I'd see the day.

AGI achieved!

This was considered by many to be the most impossible AGI standard to achieve.

Congrats to the OpenAI team, I never thought I'd see the day.
Jack Lindsey (@jack_w_lindsey) 's Twitter Profile Photo

We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic!  We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…

Samuel Marks (@saprmarks) 's Twitter Profile Photo

What's an RL algorithms researcher's job? To make reward go up. What's an alignment auditing researcher's job? To uh.. check if models are aligned? Recent work unlocks a potential answer to this question: To build tools that make auditing agent win rate go up. New blog post.

What's an RL algorithms researcher's job? To make reward go up.
What's an alignment auditing researcher's job? To uh.. check if models are aligned?

Recent work unlocks a potential answer to this question: To build tools that make auditing agent win rate go up.

New blog post.
Brydon Eastman (@brhydon) 's Twitter Profile Photo

look there's this midwit curve Jason Wei posted a few years back about "just play with the model for 5 mins / heavy eval suite / just play with the model" and it's certainly SO true for these SWE models man. idgaf what your swebench score is. it's practically anti-signal

Benjamin Todd (@ben_j_todd) 's Twitter Profile Photo

The real AGI wake up hasn't happened yet. Epoch AI estimate if you actually believe we'll reach 10% task automation before 2030, the optimal investment in compute is over $10 trillion pa, 50x higher than today.

jack morris (@jxmnop) 's Twitter Profile Photo

the other day i was chatting with John Schulman and received an excellent suggestion: why not frame this 'alignment reversal' as optimization? we can use a subset of web text to search for the smallest possible model update that makes gpt-oss behave as a base model

Miles Brundage (@miles_brundage) 's Twitter Profile Photo

jack morris The Harry Potter thing is interesting but FWIW that still seems like a very strong claim w/ relatively limited evidence - esp. if you're arguing "this is v. close to the original model" as opposed to "there was a base model, and this is at least somewhat closer to it than before"