Alexander Golubev (@agolubev13) 's Twitter Profile
Alexander Golubev

@agolubev13

LLMs, RL and Agents
Research Lead @ Nebius AI

ID: 1698713095311355904

calendar_today04-09-2023 15:02:29

10 Tweet

5 Takipçi

11 Takip Edilen

hr0nix @ ICLR (@hr0nix) 's Twitter Profile Photo

Can open-weight models match frontier LLM performance on SWE-bench? They can if you equip them with search! We've been studying how guided search can improve SWE agents, and built an SWE-agent-based system that scores 40.6% on SWE-Bench Verified using only open-weight models. 🧵

hr0nix @ ICLR (@hr0nix) 's Twitter Profile Photo

Spirit of open-source is in the air thanks to DeepSeek! And today we are happy to release kvax, our implementation of flash attention for jax! It is very fast and has some advanced features such as context parallelism support that might not be easy to come by. Details ⬇️

hr0nix @ ICLR (@hr0nix) 's Twitter Profile Photo

LLMs trained to evaluate agentic trajectories give us a powerful way to boost agent performance via test-time search. But single-pass value models have their limitations. Can CoT reasoners be a better alternative? We explore this topic in our latest research blogpost 🧵⬇️

hr0nix @ ICLR (@hr0nix) 's Twitter Profile Photo

One of our research interests in Nebius is agentic software engineering. Because of that, we have reviewed LOTS of agent evals on software engineering tasks, and there were issues about these evals that made us unhappy. Today we are making a step towards fixing some of them ⬇️

hr0nix @ ICLR (@hr0nix) 's Twitter Profile Photo

An extended writeup of our earlier research blogpost on training critics for SWE agents has been accepted to ICML! Some details below ⬇️

An extended writeup of our earlier research blogpost on training critics for SWE agents has been accepted to ICML! Some details below ⬇️
hr0nix @ ICLR (@hr0nix) 's Twitter Profile Photo

A big update of SWE-rebench: new tasks (May), frontier models (o3 and sonnet), tool use. Dive in for details (leaderboard link in the last message) ⬇️

A big update of SWE-rebench: new tasks (May), frontier models (o3 and sonnet), tool use. Dive in for details (leaderboard link in the last message) ⬇️
Nebius (@nebiusai) 's Twitter Profile Photo

Our own SWE-rebench just became the #1 most downloaded dataset on @HuggingFace 🥇 SWE-rebench is a dataset and benchmark for code agents based on LLMs, developed by our AI R&D team. It has been downloaded more than 3.9M times — 3.1M in the last month. 1/4

Our own SWE-rebench just became the #1 most downloaded dataset on @HuggingFace 🥇

SWE-rebench is a dataset and benchmark for code agents based on LLMs, developed by our AI R&D team. It has been downloaded more than 3.9M times — 3.1M in the last month. 1/4
Alexander Golubev (@agolubev13) 's Twitter Profile Photo

Grok Code Fast is incredibly powerful for its cost! 5 cents per problem for ~o3 performance. It's also great to see open-source models like GLM-4.5 and Qwen3-Coder-480B competing with the frontier. One other thing worth noting: while the highest Pass@5 for a single model is

Grok Code Fast is incredibly powerful for its cost! 5 cents per problem for ~o3 performance. It's also great to see open-source models like GLM-4.5 and Qwen3-Coder-480B competing with the frontier.

One other thing worth noting: while the highest Pass@5 for a single model is
Alexander Golubev (@agolubev13) 's Twitter Profile Photo

For those who are still not used to checking for prompt caching with providers. In SWE-rebench eval, grok code fast costs $14 with caching. The exact same run without it would have cost $66. That's more than a fourfold difference, requiring no effort on your part. So a) don't

For those who are still not used to checking for prompt caching with providers. In SWE-rebench eval, grok code fast costs $14 with caching. The exact same run without it would have cost $66. That's more than a fourfold difference, requiring no effort on your part. 

So a) don't
Alexander Golubev (@agolubev13) 's Twitter Profile Photo

What are the key evaluation directions you'd like to see for LLMs as SWE-agents? In SWE-rebench, we measure how well models solve GitHub issues. We're considering extensions, primarily measuring how well models write tests for these fixes, and extending the benchmark to other

Alexander Golubev (@agolubev13) 's Twitter Profile Photo

For those who were interested in latest Claude models performance...Sonnet4.5 is surprisingly good, especially its ability to solve tasks that couldn't be solved by any other models, check the PRs: - github.com/python-trio/tr… - github.com/cubed-dev/cube… - github.com/canopen-python…