Alexander Golubev (@agolubev13) Twitter Tweets • TwiCopy

hr0nix @ ICLR

a year ago

Can open-weight models match frontier LLM performance on SWE-bench? They can if you equip them with search! We've been studying how guided search can improve SWE agents, and built an SWE-agent-based system that scores 40.6% on SWE-Bench Verified using only open-weight models. 🧵

thumb_up_off_alt74

chat_bubble_outline5

repeat20

shareShare

hr0nix @ ICLR

@hr0nix

9 months ago

Spirit of open-source is in the air thanks to DeepSeek! And today we are happy to release kvax, our implementation of flash attention for jax! It is very fast and has some advanced features such as context parallelism support that might not be easy to come by. Details ⬇️

thumb_up_off_alt66

chat_bubble_outline4

repeat19

shareShare

hr0nix @ ICLR

@hr0nix

7 months ago

LLMs trained to evaluate agentic trajectories give us a powerful way to boost agent performance via test-time search. But single-pass value models have their limitations. Can CoT reasoners be a better alternative? We explore this topic in our latest research blogpost 🧵⬇️

thumb_up_off_alt40

chat_bubble_outline1

repeat11

shareShare

hr0nix @ ICLR

@hr0nix

6 months ago

One of our research interests in Nebius is agentic software engineering. Because of that, we have reviewed LOTS of agent evals on software engineering tasks, and there were issues about these evals that made us unhappy. Today we are making a step towards fixing some of them ⬇️

thumb_up_off_alt38

chat_bubble_outline3

repeat15

shareShare

hr0nix @ ICLR

@hr0nix

6 months ago

An extended writeup of our earlier research blogpost on training critics for SWE agents has been accepted to ICML! Some details below ⬇️

thumb_up_off_alt14

chat_bubble_outline1

repeat5

shareShare

hr0nix @ ICLR

@hr0nix

6 months ago

A big update of SWE-rebench: new tasks (May), frontier models (o3 and sonnet), tool use. Dive in for details (leaderboard link in the last message) ⬇️

thumb_up_off_alt12

chat_bubble_outline1

repeat3

shareShare

Nebius

@nebiusai

3 months ago

Our own SWE-rebench just became the #1 most downloaded dataset on @HuggingFace 🥇 SWE-rebench is a dataset and benchmark for code agents based on LLMs, developed by our AI R&D team. It has been downloaded more than 3.9M times — 3.1M in the last month. 1/4

thumb_up_off_alt84

chat_bubble_outline2

repeat11

shareShare

Alexander Golubev

@agolubev13

3 months ago

Grok Code Fast is incredibly powerful for its cost! 5 cents per problem for ~o3 performance. It's also great to see open-source models like GLM-4.5 and Qwen3-Coder-480B competing with the frontier. One other thing worth noting: while the highest Pass@5 for a single model is

thumb_up_off_alt7

chat_bubble_outline1

repeat3

shareShare

Alexander Golubev

@agolubev13

3 months ago

For those who are still not used to checking for prompt caching with providers. In SWE-rebench eval, grok code fast costs $14 with caching. The exact same run without it would have cost $66. That's more than a fourfold difference, requiring no effort on your part. So a) don't

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Alexander Golubev

@agolubev13

3 months ago

What are the key evaluation directions you'd like to see for LLMs as SWE-agents? In SWE-rebench, we measure how well models solve GitHub issues. We're considering extensions, primarily measuring how well models write tests for these fixes, and extending the benchmark to other

thumb_up_off_alt9

chat_bubble_outline2

repeat3

shareShare

Alexander Golubev

@agolubev13

a month ago

For those who were interested in latest Claude models performance...Sonnet4.5 is surprisingly good, especially its ability to solve tasks that couldn't be solved by any other models, check the PRs: - github.com/python-trio/tr… - github.com/cubed-dev/cube… - github.com/canopen-python…

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare