Saiteja Utpala (@saitejautpala) Twitter Tweets • TwiCopy

Sebastien Bubeck

8 months ago

Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.

thumb_up_off_alt4,4K

chat_bubble_outline150

repeat659

shareShare

Mostafa Rohaninejad

@mostafarohani

7 months ago

1/n I’m really excited to share that our OpenAI reasoning system got a perfect score of 12/12 during the 2025 ICPC World Finals, the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems. This would have

1/n
I’m really excited to share that our <a href="/OpenAI/">OpenAI</a> reasoning system got a perfect score of 12/12 during the 2025 ICPC World Finals, the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems. This would have

thumb_up_off_alt2,2K

chat_bubble_outline134

repeat434

shareShare

Fabian Pedregosa

@fpedregosa

7 months ago

And a new era of mathematical physics begins deepmind.google/discover/blog/…

thumb_up_off_alt22

chat_bubble_outline0

repeat9

shareShare

swyx

@swyx

6 months ago

I find it unimaginably based that the OAI Evals team keeps making benchmarks finding that Claude is better and publishing it anyway. they are 3 for 3 this year in acknowledging specifically how much Claude is better at tasks OAI care about. there is no sarcasm here folks. this

thumb_up_off_alt1,1K

chat_bubble_outline52

repeat72

shareShare

Sayash Kapoor

@sayashk

6 months ago

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and

thumb_up_off_alt38

chat_bubble_outline2

repeat13

shareShare

Himanshu Tyagi

@hstyagi

6 months ago

I have been dabbling with using AI for maths too. There is this COLT 2020 paper where we tried to find best way to quantize Gaussian observations for mean estimation. We couldn't establish exact optimality because we had some extra dependencies coming in our bounds. I had

thumb_up_off_alt356

chat_bubble_outline152

repeat21

shareShare

Terry Yue Zhuo

@terryyuezhuo

6 months ago

It’s so much fun working with the other 39 community members on this project! Start to try out various frontier models in BigCodeArena today.

thumb_up_off_alt130

chat_bubble_outline11

repeat37

shareShare

Sundar Pichai

@sundarpichai

6 months ago

An exciting milestone for AI in science: Our C2S-Scale 27B foundation model, built with Yale University and based on Gemma, generated a novel hypothesis about cancer cellular behavior, which scientists experimentally validated in living cells. With more preclinical and clinical tests,

thumb_up_off_alt17,17K

chat_bubble_outline439

repeat2,2K

shareShare

Sayash Kapoor

@sayashk

6 months ago

📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9

thumb_up_off_alt399

chat_bubble_outline20

repeat91

shareShare

Build That Idea

@buildthatidea

4 months ago

the whale is back!!! deepseek just dropped deepseek math v2 > built on top of deepseek-v3.2-exp-base > achieves gold-level scores on IMO 2025 & CMO 2024 > scores 118/120 on Putnam 2024 with scaled test-time compute > shows LLMs can now verify their own proofs, not just spit out

thumb_up_off_alt303

chat_bubble_outline10

repeat28

shareShare

Nilesh Trivedi

@nileshtrivedi

4 months ago

It took 6 hour of LLM + 1 minute of Lean to solve Erdos problem #124. 👏

thumb_up_off_alt48

chat_bubble_outline1

repeat7

shareShare

Sayash Kapoor

@sayashk

4 months ago

CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses

thumb_up_off_alt795

chat_bubble_outline27

repeat108

shareShare

ARC Prize

@arcprize

4 months ago

A year ago, we verified a preview of an unreleased version of OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task This represents a ~390X efficiency improvement in one year

A year ago, we verified a preview of an unreleased version of <a href="/OpenAI/">OpenAI</a> o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task

Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task

This represents a ~390X efficiency improvement in one year

thumb_up_off_alt4,4K

chat_bubble_outline150

repeat646

shareShare

Sebastien Bubeck

@sebastienbubeck

4 months ago

YAAAYYY!!

thumb_up_off_alt342

chat_bubble_outline3

repeat17

shareShare

Jaana Dogan ヤナドガン

@rakyll

3 months ago

I'm not joking and this isn't funny. We have been trying to build distributed agent orchestrators at Google since last year. There are various options, not everyone is aligned... I gave Claude Code a description of the problem, it generated what we built last year in an hour.

thumb_up_off_alt24,24K

chat_bubble_outline791

repeat2,2K

shareShare

Dan Mac

@daniel_mac8

3 months ago

GPT-5.3-Codex powered agentic coding at 2,000/tps. That’s what the OpenAI <> Cerebras deal will make possible. Opus 4.5 in Claude Code comes in at ~100/tps. This deal is a huge deal. It will make so much possible.

thumb_up_off_alt1,1K

chat_bubble_outline126

repeat129

shareShare

Axiom

@axiommathai

2 months ago

1/ AxiomProver has solved Fel’s open conjecture on syzygies of numerical semigroups, autonomously generating a formal proof in Lean with zero human guidance. This is the first time an AI system has settled an unsolved research problem in theory-building math and self verifies.

thumb_up_off_alt972

chat_bubble_outline26

repeat173

shareShare

Sam Altman

@sama

2 months ago

GPT-5.3-Codex is here! *Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). *Mid-task steerability and live updates during tasks. *Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token! *Good computer use.

thumb_up_off_alt15,15K

chat_bubble_outline1,1K

repeat1,1K

shareShare