Saiteja Utpala (@saitejautpala) 's Twitter Profile
Saiteja Utpala

@saitejautpala

Into Machine Learning

ID: 1301938969622376448

calendar_today04-09-2020 17:43:52

101 Tweet

93 Takipçi

1,1K Takip Edilen

Sebastien Bubeck (@sebastienbubeck) 's Twitter Profile Photo

Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.

Claim: gpt-5-pro can prove new interesting mathematics.

Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct.

Details below.
Mostafa Rohaninejad (@mostafarohani) 's Twitter Profile Photo

1/n I’m really excited to share that our OpenAI reasoning system got a perfect score of 12/12 during the 2025 ICPC World Finals, the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems. This would have

1/n
I’m really excited to share that our <a href="/OpenAI/">OpenAI</a> reasoning system got a perfect score of 12/12 during the 2025 ICPC World Finals, the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems. This would have
swyx (@swyx) 's Twitter Profile Photo

I find it unimaginably based that the OAI Evals team keeps making benchmarks finding that Claude is better and publishing it anyway. they are 3 for 3 this year in acknowledging specifically how much Claude is better at tasks OAI care about. there is no sarcasm here folks. this

I find it unimaginably based that the OAI Evals team keeps making benchmarks finding that Claude is better and publishing it anyway. 

they are 3 for 3 this year in acknowledging specifically how much Claude is better at tasks OAI care about.

there is no sarcasm here folks. this
Sayash Kapoor (@sayashk) 's Twitter Profile Photo

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. 

For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and
Himanshu Tyagi (@hstyagi) 's Twitter Profile Photo

I have been dabbling with using AI for maths too. There is this COLT 2020 paper where we tried to find best way to quantize Gaussian observations for mean estimation. We couldn't establish exact optimality because we had some extra dependencies coming in our bounds. I had

Terry Yue Zhuo (@terryyuezhuo) 's Twitter Profile Photo

It’s so much fun working with the other 39 community members on this project! Start to try out various frontier models in BigCodeArena today.

It’s so much fun working with the other 39 community members on this project!

Start to try out various frontier models in BigCodeArena today.
Sundar Pichai (@sundarpichai) 's Twitter Profile Photo

An exciting milestone for AI in science: Our C2S-Scale 27B foundation model, built with Yale University and based on Gemma, generated a novel hypothesis about cancer cellular behavior, which scientists experimentally validated in living cells.  With more preclinical and clinical tests,

Sayash Kapoor (@sayashk) 's Twitter Profile Photo

📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9

📣New paper: Rigorous AI agent evaluation is much harder than it seems.

For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks. 

Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9
Build That Idea (@buildthatidea) 's Twitter Profile Photo

the whale is back!!! deepseek just dropped deepseek math v2 > built on top of deepseek-v3.2-exp-base > achieves gold-level scores on IMO 2025 & CMO 2024 > scores 118/120 on Putnam 2024 with scaled test-time compute > shows LLMs can now verify their own proofs, not just spit out

the whale is back!!!

deepseek just dropped deepseek math v2

&gt; built on top of deepseek-v3.2-exp-base
&gt; achieves gold-level scores on IMO 2025 &amp; CMO 2024
&gt; scores 118/120 on Putnam 2024 with scaled test-time compute
&gt; shows LLMs can now verify their own proofs, not just spit out
Sayash Kapoor (@sayashk) 's Twitter Profile Photo

CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses

CORE-Bench is solved (using Opus 4.5 with Claude Code)

TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses
ARC Prize (@arcprize) 's Twitter Profile Photo

A year ago, we verified a preview of an unreleased version of OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task This represents a ~390X efficiency improvement in one year

A year ago, we verified a preview of an unreleased version of <a href="/OpenAI/">OpenAI</a> o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task

Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task

This represents a ~390X efficiency improvement in one year
Jaana Dogan ヤナ ドガン (@rakyll) 's Twitter Profile Photo

I'm not joking and this isn't funny. We have been trying to build distributed agent orchestrators at Google since last year. There are various options, not everyone is aligned... I gave Claude Code a description of the problem, it generated what we built last year in an hour.

Dan Mac (@daniel_mac8) 's Twitter Profile Photo

GPT-5.3-Codex powered agentic coding at 2,000/tps. That’s what the OpenAI <> Cerebras deal will make possible. Opus 4.5 in Claude Code comes in at ~100/tps. This deal is a huge deal. It will make so much possible.

GPT-5.3-Codex powered agentic coding at 2,000/tps.

That’s what the OpenAI &lt;&gt; Cerebras deal will make possible.

Opus 4.5 in Claude Code comes in at ~100/tps.

This deal is a huge deal. 
It will make so much possible.
Axiom (@axiommathai) 's Twitter Profile Photo

1/ AxiomProver has solved Fel’s open conjecture on syzygies of numerical semigroups, autonomously generating a formal proof in Lean with zero human guidance. This is the first time an AI system has settled an unsolved research problem in theory-building math and self verifies.

1/ AxiomProver has solved Fel’s open conjecture on syzygies of numerical semigroups, autonomously generating a formal proof in Lean with zero human guidance.

This is the first time an AI system has settled an unsolved research problem in theory-building math and self verifies.
Sam Altman (@sama) 's Twitter Profile Photo

GPT-5.3-Codex is here! *Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). *Mid-task steerability and live updates during tasks. *Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token! *Good computer use.