Ibragim (@ibragim_bad) Twitter Tweets • TwiCopy

Ibragim

2 months ago

one interesting point is that OpenAI Developers didn't mention SWE-Lancer at all despite its release just a year ago. I actually liked the idea of e2e tests for the front end and back end. but, yeah, one repo benchmark is not what you want

one interesting point is that <a href="/OpenAIDevs/">OpenAI Developers</a> didn't mention SWE-Lancer at all despite its release just a year ago.

I actually liked the idea of e2e tests for the front end and back end.

but, yeah, one repo benchmark is not what you want

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

2 months ago

I like it. Check this out, might be the golden middle between agents wrecking your system like at Meta and VM paranoia.

thumb_up_off_alt74

chat_bubble_outline2

repeat6

shareShare

Grok

@grok

2 months ago

Ankith 🐋/acc Ibragim Simon Karasik Alexander Golubev A solid example of Docker isolation failing: CVE-2019-5736 in runc (Docker's default runtime). If the AI agent (running as root, common in sandboxes) drops malicious code that overwrites the host's runc binary via a /proc/self/exe symlink trick during any exec, boom—next

thumb_up_off_alt3

chat_bubble_outline1

repeat2

shareShare

Probably Ballistic

@basedstatistic

2 months ago

Nebius thinking ahead. This is a force majeure level exigent problem in tech.

thumb_up_off_alt1

chat_bubble_outline2

repeat1

shareShare

jytan

@jyt4n

2 months ago

Experimenting with this locally in the past few weeks too, with many similar ideas. Sandbox snapshot/branching semantics are still very underrated, and will be a massive unlock for coding agents. Super fun to rabbit-hole in this space.

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Alexander Golubev

@agolubev13

2 months ago

1/8 Training draft models for speculative decoding almost always relies on KL divergence – a proxy, that typically leads to convergence to suboptimal solutions under limited capacity. We introduce LK losses: training objectives that directly target acceptance rate instead. We

thumb_up_off_alt16

chat_bubble_outline2

repeat5

shareShare

Ibragim

@ibragim_bad

2 months ago

btw, at Nebius we run high-quality, in-depth offline courses completely for free! for example, in London we're currently hosting: AI-Assisted Mathematical Discovery academy.nebius.com/ai-for-math + a cool event in Amsterdam this April!

thumb_up_off_alt9

chat_bubble_outline1

repeat1

shareShare

Alexander Golubev

@agolubev13

2 months ago

Building RL environments at scale is one of the hardest problems in AI agents development, particularly ensuring tasks are actually solvable, and that verification is neither too strict nor too loose. With SWE-rebench V2, we're taking a step towards large-scale datasets spanning

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

BeastTitanHunter

@titan1beast

2 months ago

Nice ! OpenAI Anthropic xAI MiniMax (official) Kimi.ai Qwen DeepSeek Z.ai

thumb_up_off_alt0

chat_bubble_outline0

repeat1

shareShare

Roman Chernin

@romanchernin

2 months ago

We just released SWE-rebench-V2 - the largest open dataset in the world for training code agents! 🚀 In a nutshell: - 32,000+ executable tasks - every task comes with a pre-built Docker env. - 20 programming languages - moving beyond Python-only datasets. - 120,000+ extra tasks

thumb_up_off_alt279

chat_bubble_outline7

repeat23

shareShare

Ibragim

@ibragim_bad

2 months ago

speed in code agentic tasks is also very compelling for users one of the top requests that we get for swe-rebench the complexity with evals for this kind of stuff is that there are so many confounders besides token counts, like infra and even your wi-fi speed will affect it

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Ibragim

@ibragim_bad

2 months ago

Recently recorded a podcast with the JetBrains Research team! JetBrains It's a fresh podcast, only the second episode (the first one was with Anna Kogan, ex-CEO of OpenCV) We talked about my journey from dentistry to research, SWE-rebench, coding agents, and our latest

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare

Ori Press

@ori_press

2 months ago

We evaluated GPT 5.4 on AlgoTune: for the first time an OpenAI model is worse than its predecessor. Some analysis: In graph_laplacian, GPT-5.2's approach is: build the sparse matrix once, call SciPy’s Laplacian routine, and return the sparse result directly. (cont.)🧵

We evaluated GPT 5.4 on AlgoTune: for the first time an <a href="/OpenAI/">OpenAI</a> model is worse than its predecessor. Some analysis:

In graph_laplacian, GPT-5.2's approach is: build the sparse matrix once, call SciPy’s Laplacian routine, and return the sparse result directly.

(cont.)🧵

thumb_up_off_alt19

chat_bubble_outline2

repeat2

shareShare

Ibragim

@ibragim_bad

2 months ago

If one automated filtered swe rl env costs: ~$500 per task Does it mean that SWE-rebench-V2 costs: ~32k tasks * 500 = $16 mln? and for additional tasks (need to build the Docker image) 126k*$500 = $63 mln? We are giving all of this for free, it is basically open + if you want

thumb_up_off_alt51

chat_bubble_outline2

repeat3

shareShare

Nebius

@nebiusai

2 months ago

Every AI product running in production depends on an inference system someone had to engineer, optimize and more. Scheduling, batching, routing, cost per token. This is a craft. The Inference Frontier Program spotlights the builders behind that work. 💡 Watch the video

thumb_up_off_alt338

chat_bubble_outline8

repeat37

shareShare