Ibragim (@ibragim_bad) 's Twitter Profile
Ibragim

@ibragim_bad

Data for coding agents | @nebiusai

ID: 882369312517959682

calendar_today04-07-2017 22:43:32

31 Tweet

56 Takipçi

80 Takip Edilen

Ibragim (@ibragim_bad) 's Twitter Profile Photo

one interesting point is that OpenAI Developers didn't mention SWE-Lancer at all despite its release just a year ago. I actually liked the idea of e2e tests for the front end and back end. but, yeah, one repo benchmark is not what you want

one interesting point is that <a href="/OpenAIDevs/">OpenAI Developers</a> didn't mention SWE-Lancer at all despite its release just a year ago.

I actually liked the idea of e2e tests for the front end and back end.

but, yeah, one repo benchmark is not what you want
Grok (@grok) 's Twitter Profile Photo

Ankith 🐋/acc Ibragim Simon Karasik Alexander Golubev A solid example of Docker isolation failing: CVE-2019-5736 in runc (Docker's default runtime). If the AI agent (running as root, common in sandboxes) drops malicious code that overwrites the host's runc binary via a /proc/self/exe symlink trick during any exec, boom—next

jytan (@jyt4n) 's Twitter Profile Photo

Experimenting with this locally in the past few weeks too, with many similar ideas. Sandbox snapshot/branching semantics are still very underrated, and will be a massive unlock for coding agents. Super fun to rabbit-hole in this space.

Alexander Golubev (@agolubev13) 's Twitter Profile Photo

1/8 Training draft models for speculative decoding almost always relies on KL divergence – a proxy, that typically leads to convergence to suboptimal solutions under limited capacity. We introduce LK losses: training objectives that directly target acceptance rate instead. We

1/8 Training draft models for speculative decoding almost always relies on KL divergence – a proxy, that typically leads to convergence to suboptimal solutions under limited capacity. We introduce LK losses: training objectives that directly target acceptance rate instead. 
We
Ibragim (@ibragim_bad) 's Twitter Profile Photo

btw, at Nebius we run high-quality, in-depth offline courses completely for free! for example, in London we're currently hosting: AI-Assisted Mathematical Discovery academy.nebius.com/ai-for-math + a cool event in Amsterdam this April!

Alexander Golubev (@agolubev13) 's Twitter Profile Photo

Building RL environments at scale is one of the hardest problems in AI agents development, particularly ensuring tasks are actually solvable, and that verification is neither too strict nor too loose. With SWE-rebench V2, we're taking a step towards large-scale datasets spanning

Roman Chernin (@romanchernin) 's Twitter Profile Photo

We just released SWE-rebench-V2 - the largest open dataset in the world for training code agents! 🚀 In a nutshell: - 32,000+ executable tasks - every task comes with a pre-built Docker env. - 20 programming languages - moving beyond Python-only datasets. - 120,000+ extra tasks

Ibragim (@ibragim_bad) 's Twitter Profile Photo

speed in code agentic tasks is also very compelling for users one of the top requests that we get for swe-rebench the complexity with evals for this kind of stuff is that there are so many confounders besides token counts, like infra and even your wi-fi speed will affect it

speed in code agentic tasks is also very compelling for users

one of the top requests that we get for swe-rebench

the complexity with evals for this kind of stuff is that there are so many confounders besides token counts, like infra and even your wi-fi speed will affect it
Ibragim (@ibragim_bad) 's Twitter Profile Photo

Recently recorded a podcast with the JetBrains Research team! JetBrains It's a fresh podcast, only the second episode (the first one was with Anna Kogan, ex-CEO of OpenCV) We talked about my journey from dentistry to research, SWE-rebench, coding agents, and our latest

Ori Press (@ori_press) 's Twitter Profile Photo

We evaluated GPT 5.4 on AlgoTune: for the first time an OpenAI model is worse than its predecessor. Some analysis: In graph_laplacian, GPT-5.2's approach is: build the sparse matrix once, call SciPy’s Laplacian routine, and return the sparse result directly. (cont.)🧵

We evaluated GPT 5.4 on AlgoTune: for the first time an <a href="/OpenAI/">OpenAI</a> model is worse than its predecessor. Some analysis:

In graph_laplacian, GPT-5.2's approach is: build the sparse matrix once, call SciPy’s Laplacian routine, and return the sparse result directly. 

(cont.)🧵
Ibragim (@ibragim_bad) 's Twitter Profile Photo

If one automated filtered swe rl env costs: ~$500 per task Does it mean that SWE-rebench-V2 costs: ~32k tasks * 500 = $16 mln? and for additional tasks (need to build the Docker image) 126k*$500 = $63 mln? We are giving all of this for free, it is basically open + if you want

Nebius (@nebiusai) 's Twitter Profile Photo

Every AI product running in production depends on an inference system someone had to engineer, optimize and more. Scheduling, batching, routing, cost per token. This is a craft. The Inference Frontier Program spotlights the builders behind that work. 💡 Watch the video