Alpay Ariyak (@alpayariyak) 's Twitter Profile
Alpay Ariyak

@alpayariyak

LLM Post-Training Lead @ Together AI | OpenChat Project Lead (2M+ downloads, #1 7B LLM on Arena for 2+ months) | DeepCoder

ID: 1682513195813027842

linkhttps://huggingface.co/openchat/openchat-3.5-0106 calendar_today21-07-2023 22:10:33

227 Tweet

2,2K Followers

2,2K Following

Agentica Project (@agentica_) 's Twitter Profile Photo

Introducing DeepCoder-14B-Preview - our fully open-sourced reasoning model reaching o1 and o3-mini level on coding and math. The best part is, we’re releasing everything: not just the model, but the dataset, code, and training recipe—so you can train it yourself!🔥 Links below:

Introducing DeepCoder-14B-Preview - our fully open-sourced reasoning model reaching o1 and o3-mini level on coding and math.

The best part is, we’re releasing everything: not just the model, but the dataset, code, and training recipe—so you can train it yourself!🔥

Links below:
Alpay Ariyak (@alpayariyak) 's Twitter Profile Photo

Excited to present our project in collaboration with Agentica: 14B LLM trained with Code RL that reaches OpenAI's o3-mini-low performance on coding benchmarks like LiveCodeBench, Codeforces and HumanEval! We open source the code, data, weights and full recipe

Excited to present our project in collaboration with Agentica: 
14B LLM trained with Code RL that reaches OpenAI's o3-mini-low performance on coding benchmarks like LiveCodeBench, Codeforces and HumanEval! 

We open source the code, data, weights and full recipe
Alpay Ariyak (@alpayariyak) 's Twitter Profile Photo

I’ve seen some people using DeepCoder-1.5B as speculator for DeepCoder-14B. Because the final stage for both was “self-play” RL, it won’t be a good speculator, as they diverge a lot. If there’s enough interest, we can train a good speculator (w/ logit distillation) & release

Alpay Ariyak (@alpayariyak) 's Twitter Profile Photo

Excited to join some great friends at Nous as one of the judges for their first hackathon! It will be focused on RL environments. Pull up, it will be fun :)

Agentica Project (@agentica_) 's Twitter Profile Photo

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models.

💪DeepSWE
Alpay Ariyak (@alpayariyak) 's Twitter Profile Photo

Excited to introduce DeepSWE-Preview, our latest model trained in collaboration with Agentica Project Using only RL, we increase the performance of Qwen 3 32B from 23% to 42.2% on SWE-Bench Verified!

Excited to introduce DeepSWE-Preview, our latest model trained in collaboration with <a href="/Agentica_/">Agentica Project</a> 

Using only RL, we increase the performance of Qwen 3 32B from 23% to 42.2% on SWE-Bench Verified!
Agentica Project (@agentica_) 's Twitter Profile Photo

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results.  

Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%.  

So how did we achieve this? 

DeepSWE generates N candidate solutions. Then, another LLM
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxestex) 's Twitter Profile Photo

important correction on DeepSWE-Preview. on SWE-Bench-Verified: Pass@1 = 42.2% "Best@8" = 59%, trajectory selection achieved with a hybrid (execution-free + test-based, as in R2E-Gym paper) verifier. Ie the system itself can yield 59% w/o ground truth check. Actual pass@8 =67%.

important correction on DeepSWE-Preview. on SWE-Bench-Verified:
Pass@1 = 42.2%
"Best@8" = 59%, trajectory selection achieved with a hybrid (execution-free + test-based, as in R2E-Gym paper) verifier. Ie the system itself can yield 59% w/o ground truth check.
Actual pass@8 =67%.