Ori Press (@ori_press) 's Twitter Profile
Ori Press

@ori_press

Graduate student @BethgeLab.
I yearn to deep learn

ID: 1076861996283367425

linkhttp://oripress.com calendar_today23-12-2018 15:28:01

104 Tweet

364 Followers

393 Following

Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.
Ofir Press (@ofirpress) 's Twitter Profile Photo

Completing games requires long context and complex visual processing- so we put a bunch of 90s games into an emulator and made a benchmark. Our agent can't even the first level of these games. You can download it right now and try it out.

Alex Zhang (@a1zhang) 's Twitter Profile Photo

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

Ofir Press (@ofirpress) 's Twitter Profile Photo

AlgoBench is extremely tough, with agents not finding substantial speedups on most tasks. But sometimes these agents do really cool things: here, the agent realized that it could solve this convex optimization problem with a scipy function, leading to an 81x speedup.

AlgoBench is extremely tough, with agents not finding substantial speedups on most tasks. But sometimes these agents do really cool things: here, the agent realized that it could solve this convex optimization problem with a scipy function, leading to an 81x speedup.
Brandon Amos (@brandondamos) 's Twitter Profile Photo

Excited to release AlgoTune!! It's a benchmark and coding agent for optimizing the runtime of numerical code 🚀 algotune.io 📚 algotune.io/paper.pdf 🤖 github.com/oripress/AlgoT… with Ofir Press Ori Press Patrick Kidger Bartolomeo Stellato Arman Zharmagambetov & many others 🧵

Ori Press (@ori_press) 's Twitter Profile Photo

We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the models that will be released this week manage to make progress. Also: I just defended my PhD and I'm on the industry job market, my DMs are open :)

We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the models that will be released this week manage to make progress.

Also: I just defended my PhD and I'm on the industry job market, my DMs are open :)
Ofir Press (@ofirpress) 's Twitter Profile Photo

We know that a bunch of teams are working on applying AlphaEvolve to AlgoTune, super excited to see some initial results! This is going to get super interesting.

We know that a bunch of teams are working on applying AlphaEvolve to AlgoTune, super excited to see some initial results! This is going to get super interesting.
Ori Press (@ori_press) 's Twitter Profile Photo

Just added Claude Opus 4.1 and gpt-oss-120b to the AlgoTune leaderboard. Excited to see if GPT-5 can break the 2 barrier!

Just added Claude Opus 4.1 and gpt-oss-120b to the AlgoTune leaderboard. Excited to see if GPT-5 can break the 2 barrier!
Richard Suwandi @ICLR2025 (@richardcsuwandi) 's Twitter Profile Photo

Introducing OpenEvolve x AlgoTune! Now you can run and benchmark evolutionary coding agents on 100+ algorithm optimization tasks from algotune.io

Introducing OpenEvolve x AlgoTune!  

Now you can run and benchmark evolutionary coding agents on 100+ algorithm optimization tasks from algotune.io
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵