Igor Kotenkov (@stalkermustang) 's Twitter Profile
Igor Kotenkov

@stalkermustang

ID: 603750843

calendar_today09-06-2012 14:59:47

1,1K Tweet

972 Followers

456 Following

Igor Kotenkov (@stalkermustang) 's Twitter Profile Photo

My friend had GPT-5.2-pro grade the solution given the ground truth PDF. Results: > Net: incorrect for 2, 3, 7, correct for 1, 4, 5, 9, 10, and 6, 8 look correct but are the ones where correctness depends on technical details Interestingly, Jakub says they're somehow confident 2

My friend had GPT-5.2-pro grade the solution given the ground truth PDF. Results:
> Net: incorrect for 2, 3, 7, correct for 1, 4, 5, 9, 10, and 6, 8 look correct but are the ones where correctness depends on technical details

Interestingly, Jakub says they're somehow confident 2
Igor Kotenkov (@stalkermustang) 's Twitter Profile Photo

TLDW if you're short on time: GPT-5.2 derived a new result in theoretical physics (...as they wrote in the announcement) and did not invent new physics.

Igor Kotenkov (@stalkermustang) 's Twitter Profile Photo

From this it's easy to see what Mechanize meant by "replication training" It seems like a fruit hanging sooooo low now Close the loop, improve 🚀

Igor Kotenkov (@stalkermustang) 's Twitter Profile Photo

For my non-American friends: current level is around 4.3%, so 18% is... x4 of that. "Unemployment is up 100% becuae of AI" sounds scary, but will be 2x less than the threshold in this tweet.

Igor Kotenkov (@stalkermustang) 's Twitter Profile Photo

5.2 is on par with Opus 4.5 on the private split of SWE-Bench Pro dataset by Scale AI. It consists of proprietary codebases from startup partners that are not publicly available, so they can't be in pretraining, contrary to SWE-Bench and even Re-Bench (where models could be more

5.2 is on par with Opus 4.5 on the private split of SWE-Bench Pro dataset by Scale AI.

It consists of proprietary codebases from startup partners that are not publicly available, so they can't be in pretraining, contrary to SWE-Bench and even Re-Bench (where models could be more
Igor Kotenkov (@stalkermustang) 's Twitter Profile Photo

I called that out 3 months ago (and even earlier, but last Fall my "hot take" tweet got 80k views, so I feel obliged to report on my prediction. ikot.blog/the-illusion-o… TLDR on the image

I called that out 3 months ago (and even earlier, but last Fall my "hot take" tweet got 80k views, so I feel obliged to report on my prediction.

ikot.blog/the-illusion-o…

TLDR on the image
Igor Kotenkov (@stalkermustang) 's Twitter Profile Photo

It's obviously not only this benchmark. Like, it's not a secret at all that the open models are lagging behind. I wrote a blogpost with a detailed analysis, taking K2 Thinking (released 3 months ago -> we have benchmark data) as a reference. Go read it: ikot.blog/the-illusion-o…

It's obviously not only this benchmark. Like, it's not a secret at all that the open models are lagging behind.

I wrote a blogpost with a detailed analysis, taking K2 Thinking (released 3 months ago -> we have benchmark data) as a reference. 

Go read it: ikot.blog/the-illusion-o…