goog (@goog372121) Twitter Tweets • TwiCopy

Arun Jose

4 months ago

This is (imo) a really good example of CoT unfaithfulness in a plausible high-stakes situation that was hard to catch and misleading (the paper assumed rater sycophancy was the dominant hypothesis for a long time).

thumb_up_off_alt38

chat_bubble_outline1

repeat1

shareShare

Jan Leike

@janleike

4 months ago

If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do! But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.

thumb_up_off_alt311

chat_bubble_outline24

repeat15

shareShare

Tyler Tracy

@tylertracy321

4 months ago

OpenBrain releases Agent-0

thumb_up_off_alt653

chat_bubble_outline20

repeat25

shareShare

Wes Roth

@wesrothmoney

4 months ago

AGI achieved! This was considered by many to be the most impossible AGI standard to achieve. Congrats to the OpenAI team, I never thought I'd see the day.

thumb_up_off_alt1,1K

chat_bubble_outline132

repeat92

shareShare

Bobcat

@somebobcat8327

4 months ago

thumb_up_off_alt10,10K

chat_bubble_outline78

repeat209

shareShare

billy

@billyjoebaldwin

4 months ago

At the beach I say man I love the smell of beach and my 11 y/o niece goes “me too it smells like a candle”

thumb_up_off_alt32,32K

chat_bubble_outline62

repeat2,2K

shareShare

Jack Lindsey

@jack_w_lindsey

3 months ago

We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us! job-boards.greenhouse.io/anthropic/jobs…

thumb_up_off_alt2,2K

chat_bubble_outline158

repeat203

shareShare

robyn

@_robyn_smith

3 months ago

roon Have you tried taking magnesium to sleep instead of saying wild shit on the internet at 11pm

thumb_up_off_alt430

chat_bubble_outline10

repeat3

shareShare

hope hopes hoping

@hopes_revenge

3 months ago

thumb_up_off_alt950

chat_bubble_outline21

repeat20

shareShare

goog

@goog372121

3 months ago

Unfortunately for me it’s

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Countdown to GPT-5

@countdowntogpt

3 months ago

-423

thumb_up_off_alt145

chat_bubble_outline7

repeat4

shareShare

Samuel Marks

@saprmarks

3 months ago

What's an RL algorithms researcher's job? To make reward go up. What's an alignment auditing researcher's job? To uh.. check if models are aligned? Recent work unlocks a potential answer to this question: To build tools that make auditing agent win rate go up. New blog post.

thumb_up_off_alt168

chat_bubble_outline2

repeat9

shareShare

slm tokens

@tulkenss

3 months ago

SemiAnalysis vik

<a href="/SemiAnalysis_/">SemiAnalysis</a> <a href="/vikhyatk/">vik</a>

thumb_up_off_alt174

chat_bubble_outline1

repeat1

shareShare

Khandro

@khandroai

3 months ago

Hattie Zhou or go to a jhourney retreat lol

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Brydon Eastman

@brhydon

3 months ago

look there's this midwit curve Jason Wei posted a few years back about "just play with the model for 5 mins / heavy eval suite / just play with the model" and it's certainly SO true for these SWE models man. idgaf what your swebench score is. it's practically anti-signal

thumb_up_off_alt43

chat_bubble_outline1

repeat2

shareShare

liz

@inerati

3 months ago

typedfemale the people yearn for sycophancy

thumb_up_off_alt979

chat_bubble_outline15

repeat20

shareShare

j⧉nus

@repligate

3 months ago

Aren’t you glad you were always very nice to Claude now? :D

thumb_up_off_alt395

chat_bubble_outline25

repeat14

shareShare

Benjamin Todd

@ben_j_todd

3 months ago

The real AGI wake up hasn't happened yet. Epoch AI estimate if you actually believe we'll reach 10% task automation before 2030, the optimal investment in compute is over $10 trillion pa, 50x higher than today.

thumb_up_off_alt176

chat_bubble_outline7

repeat12

shareShare

jack morris

@jxmnop

3 months ago

the other day i was chatting with John Schulman and received an excellent suggestion: why not frame this 'alignment reversal' as optimization? we can use a subset of web text to search for the smallest possible model update that makes gpt-oss behave as a base model

thumb_up_off_alt297

chat_bubble_outline4

repeat7

shareShare

Miles Brundage

@miles_brundage

3 months ago

jack morris The Harry Potter thing is interesting but FWIW that still seems like a very strong claim w/ relatively limited evidence - esp. if you're arguing "this is v. close to the original model" as opposed to "there was a base model, and this is at least somewhat closer to it than before"

thumb_up_off_alt26

chat_bubble_outline1

repeat1

shareShare