George Lutas (@georgelutas1) Twitter Tweets • TwiCopy

Seems like a lot of people are taking this as gospel—when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we're using here was just a tiny bit different, we could've measured a time horizon of 8 hours, or 20 hours.

thumb_up_off_alt612

chat_bubble_outline37

repeat54

shareShare

George Lutas

@georgelutas1

a day ago

Nothing scares a lab tester more than knowing that, at any given point in time, there's someone out there about to conduct a field experiment on his work.

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

George Lutas

@georgelutas1

a day ago

Won against Opus 4.6 w/ high effort (again). Oh, and it failed to place the circle one time in the right spot and had first mover advantage, but sure, there's an exponential takeoff in ability that you didn't notice, but is present in the stats. Sigh...

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

George Lutas

@georgelutas1

a day ago

I actually like OpenAI and Anthropic models (generally). My issue is the same as if you handed me an ice-cream cone with a scoop of mashed potatoes. I like both, but tell me the mashed potatoes is vanilla and I'm gonna be mad. Stop telling me these wonderful tools live & think.

thumb_up_off_alt5

chat_bubble_outline1

repeat1

shareShare

George Lutas

@georgelutas1

a day ago

Opus 4.6 w/ high effort...couldn't even consistently draw me on tic tac toe. It tried to play a turn after I won (it had first mover advantage). Guys....WHAT ARE WE DOING!?

thumb_up_off_alt5

chat_bubble_outline3

repeat0

shareShare

George Lutas

@georgelutas1

a day ago

I'm gonna need an update from everyone who reacted to METR at the beginning of the day to their reactions given the second update at the end of the day because this stuff is hilarious and not serious at all anymore. What are we even doing here?

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

George Lutas

@georgelutas1

a day ago

People seem to assume that making sloppy apps quickly now should be compared to the subpar apps we've had until now rather than the "no excuse to not be perfect" apps we can get now. Remember, everyone has these tools. It's a multiplier. What are you multiplying?

thumb_up_off_alt1

chat_bubble_outline1

repeat0

shareShare

George Lutas

George Lutas

George Lutas

George Lutas

George Lutas

George Lutas

George Lutas

George Lutas

david rein

George Lutas

George Lutas

George Lutas

George Lutas

George Lutas

George Lutas