Igor Kotenkov (@stalkermustang) Twitter Tweets • TwiCopy

Igor Kotenkov

@stalkermustang

6 days ago

Very excited to see the results in 4 hours !

thumb_up_off_alt22

chat_bubble_outline0

repeat0

shareShare

My friend had GPT-5.2-pro grade the solution given the ground truth PDF. Results: > Net: incorrect for 2, 3, 7, correct for 1, 4, 5, 9, 10, and 6, 8 look correct but are the ones where correctness depends on technical details Interestingly, Jakub says they're somehow confident 2

thumb_up_off_alt91

chat_bubble_outline3

repeat6

shareShare

Igor Kotenkov

@stalkermustang

6 days ago

NBP made it better

thumb_up_off_alt1,1K

chat_bubble_outline23

repeat40

shareShare

Igor Kotenkov

@stalkermustang

5 days ago

TLDW if you're short on time: GPT-5.2 derived a new result in theoretical physics (...as they wrote in the announcement) and did not invent new physics.

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

Igor Kotenkov

@stalkermustang

5 days ago

From this it's easy to see what Mechanize meant by "replication training" It seems like a fruit hanging sooooo low now Close the loop, improve 🚀

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Igor Kotenkov

@stalkermustang

4 days ago

For my non-American friends: current level is around 4.3%, so 18% is... x4 of that. "Unemployment is up 100% becuae of AI" sounds scary, but will be 2x less than the threshold in this tweet.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Igor Kotenkov

@stalkermustang

4 days ago

5.2 is on par with Opus 4.5 on the private split of SWE-Bench Pro dataset by Scale AI. It consists of proprietary codebases from startup partners that are not publicly available, so they can't be in pretraining, contrary to SWE-Bench and even Re-Bench (where models could be more

thumb_up_off_alt44

chat_bubble_outline3

repeat1

shareShare

Igor Kotenkov

@stalkermustang

3 days ago

Lower gross profit margins, obviously

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Igor Kotenkov

@stalkermustang

16 hours ago

I called that out 3 months ago (and even earlier, but last Fall my "hot take" tweet got 80k views, so I feel obliged to report on my prediction. ikot.blog/the-illusion-o… TLDR on the image

thumb_up_off_alt53

chat_bubble_outline1

repeat4

shareShare

Igor Kotenkov

@stalkermustang

16 hours ago

It's obviously not only this benchmark. Like, it's not a secret at all that the open models are lagging behind. I wrote a blogpost with a detailed analysis, taking K2 Thinking (released 3 months ago -> we have benchmark data) as a reference. Go read it: ikot.blog/the-illusion-o…