Alpay Ariyak (@alpayariyak) Twitter Tweets • TwiCopy

Alpay Ariyak

@alpayariyak

+ Follow

LLM Post-Training Lead @ Together AI | OpenChat Project Lead (2M+ downloads, #1 7B LLM on Arena for 2+ months) | DeepCoder

ID: 1682513195813027842

linkhttps://huggingface.co/openchat/openchat-3.5-0106 calendar_today21-07-2023 22:10:33

227 Tweet

2,2K Followers

2,2K Following

Agentica Project

@agentica_

5 months ago

Introducing DeepCoder-14B-Preview - our fully open-sourced reasoning model reaching o1 and o3-mini level on coding and math. The best part is, we’re releasing everything: not just the model, but the dataset, code, and training recipe—so you can train it yourself!🔥 Links below:

thumb_up_off_alt886

chat_bubble_outline23

repeat224

shareShare

Alpay Ariyak

@alpayariyak

5 months ago

Excited to present our project in collaboration with Agentica: 14B LLM trained with Code RL that reaches OpenAI's o3-mini-low performance on coding benchmarks like LiveCodeBench, Codeforces and HumanEval! We open source the code, data, weights and full recipe

thumb_up_off_alt196

chat_bubble_outline20

repeat27

shareShare

ollama

@ollama

5 months ago

ollama run deepcoder 🫡 let’s go open source!!

thumb_up_off_alt1,1K

chat_bubble_outline25

repeat142

shareShare

Teknium (e/λ)

@teknium1

5 months ago

Nice!

thumb_up_off_alt55

chat_bubble_outline3

repeat1

shareShare

Alpay Ariyak

@alpayariyak

5 months ago

New article from VentureBeat about our model!

thumb_up_off_alt10

chat_bubble_outline1

repeat0

shareShare

Alpay Ariyak

@alpayariyak

5 months ago

I’ve seen some people using DeepCoder-1.5B as speculator for DeepCoder-14B. Because the final stage for both was “self-play” RL, it won’t be a good speculator, as they diverge a lot. If there’s enough interest, we can train a good speculator (w/ logit distillation) & release

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

Alpay Ariyak

@alpayariyak

5 months ago

Our DeepCoder-14B LiveCodeBench v5 scores have been validated and put on the official leaderboard!

thumb_up_off_alt322

chat_bubble_outline13

repeat37

shareShare

Alpay Ariyak

@alpayariyak

5 months ago

DeepCoder has reached the top of HuggingFace trending models 🥳

thumb_up_off_alt48

chat_bubble_outline7

repeat3

shareShare

Alpay Ariyak

@alpayariyak

4 months ago

Excited to join some great friends at Nous as one of the judges for their first hackathon! It will be focused on RL environments. Pull up, it will be fun :)

thumb_up_off_alt43

chat_bubble_outline4

repeat1

shareShare

Agentica Project

@agentica_

2 months ago

🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE

thumb_up_off_alt345

chat_bubble_outline15

repeat65

shareShare

Alpay Ariyak

@alpayariyak

2 months ago

Excited to introduce DeepSWE-Preview, our latest model trained in collaboration with Agentica Project Using only RL, we increase the performance of Qwen 3 32B from 23% to 42.2% on SWE-Bench Verified!

Excited to introduce DeepSWE-Preview, our latest model trained in collaboration with <a href="/Agentica_/">Agentica Project</a>

Using only RL, we increase the performance of Qwen 3 32B from 23% to 42.2% on SWE-Bench Verified!

thumb_up_off_alt38

chat_bubble_outline1

repeat4

shareShare

Alpay Ariyak

@alpayariyak

2 months ago

Soham Parekh was a DeepSWE checkpoint sorry

thumb_up_off_alt15

chat_bubble_outline0

repeat0

shareShare

Agentica Project

@agentica_

2 months ago

It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM

thumb_up_off_alt51

chat_bubble_outline1

repeat15

shareShare

Alpay Ariyak

@alpayariyak

2 months ago

Let’s normalize reading through and actually understanding something before attempting to criticize it publicly :)

thumb_up_off_alt32

chat_bubble_outline2

repeat1

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

2 months ago

important correction on DeepSWE-Preview. on SWE-Bench-Verified: Pass@1 = 42.2% "Best@8" = 59%, trajectory selection achieved with a hybrid (execution-free + test-based, as in R2E-Gym paper) verifier. Ie the system itself can yield 59% w/o ground truth check. Actual pass@8 =67%.

thumb_up_off_alt37

chat_bubble_outline2

repeat4

shareShare

C Zhang

@chongzitazhang

2 months ago

Only after labelling a dataset by your own you know how dirty it is

thumb_up_off_alt182

chat_bubble_outline15

repeat18

shareShare