George Lutas (@georgelutas1) 's Twitter Profile
George Lutas

@georgelutas1

Catholic (cupio dissolvi) | mid-20s

Polymath | Engineer (DMs open!)

ID: 336513070

calendar_today16-07-2011 12:30:29

5,5K Tweet

924 Takipçi

402 Takip Edilen

George Lutas (@georgelutas1) 's Twitter Profile Photo

HAHAHAHAHA!!! (Codex 5.3-Xhigh) These are wonderful tools. I love these tools. They're fantastic tools...They also don't replace your brain.

HAHAHAHAHA!!!
(Codex 5.3-Xhigh)

These are wonderful tools. I love these tools. They're fantastic tools...They also don't replace your brain.
George Lutas (@georgelutas1) 's Twitter Profile Photo

I've got to wonder how much of devs "not writing a single line of code" is literally this, the claw gripper equivalent of coding.

david rein (@idavidrein) 's Twitter Profile Photo

Seems like a lot of people are taking this as gospel—when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we're using here was just a tiny bit different, we could've measured a time horizon of 8 hours, or 20 hours.

George Lutas (@georgelutas1) 's Twitter Profile Photo

Nothing scares a lab tester more than knowing that, at any given point in time, there's someone out there about to conduct a field experiment on his work.

George Lutas (@georgelutas1) 's Twitter Profile Photo

Won against Opus 4.6 w/ high effort (again). Oh, and it failed to place the circle one time in the right spot and had first mover advantage, but sure, there's an exponential takeoff in ability that you didn't notice, but is present in the stats. Sigh...

Won against Opus 4.6 w/ high effort (again). Oh, and it failed to place the circle one time in the right spot and had first mover advantage, but sure, there's an exponential takeoff in ability that you didn't notice, but is present in the stats. Sigh...
George Lutas (@georgelutas1) 's Twitter Profile Photo

I actually like OpenAI and Anthropic models (generally). My issue is the same as if you handed me an ice-cream cone with a scoop of mashed potatoes. I like both, but tell me the mashed potatoes is vanilla and I'm gonna be mad. Stop telling me these wonderful tools live & think.

George Lutas (@georgelutas1) 's Twitter Profile Photo

Opus 4.6 w/ high effort...couldn't even consistently draw me on tic tac toe. It tried to play a turn after I won (it had first mover advantage). Guys....WHAT ARE WE DOING!?

Opus 4.6 w/ high effort...couldn't even consistently draw me on tic tac toe. It tried to play a turn after I won (it had first mover advantage).

Guys....WHAT ARE WE DOING!?
George Lutas (@georgelutas1) 's Twitter Profile Photo

I'm gonna need an update from everyone who reacted to METR at the beginning of the day to their reactions given the second update at the end of the day because this stuff is hilarious and not serious at all anymore. What are we even doing here?

George Lutas (@georgelutas1) 's Twitter Profile Photo

People seem to assume that making sloppy apps quickly now should be compared to the subpar apps we've had until now rather than the "no excuse to not be perfect" apps we can get now. Remember, everyone has these tools. It's a multiplier. What are you multiplying?