Zhoujun (Jorge) Cheng (@chengzhoujun) Twitter Tweets • TwiCopy

Tianbao Xie

4 months ago

Countless times of iterations for cooking it, but the process is satisfying. I still believe we can pour more data in each stage if we have more hands so the potential is unlimited and scaling law hasn’t hit the wall yet! Towards Digital Agents🤖 We are already on the way.

thumb_up_off_alt49

chat_bubble_outline1

repeat6

shareShare

Qian Liu

@sivil_taram

4 months ago

Wrapped up a SWE-Perf website redesign using Qwen3-Coder on AnyCoder (huggingface.co/spaces/akhaliq…). The process was incredibly fast and great! One question for Qwen devs, though: did you pretrain a secret love for the color purple into the coder's persona? 😉

thumb_up_off_alt86

chat_bubble_outline1

repeat14

shareShare

Chujie Zheng

@chujiezheng

4 months ago

Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄 huggingface.co/papers/2507.18…

thumb_up_off_alt1,1K

chat_bubble_outline18

repeat143

shareShare

Feng Yao

@fengyao1909

3 months ago

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog:

thumb_up_off_alt461

chat_bubble_outline5

repeat69

shareShare

Jinjie Ni @ ICLR'25 🇸🇬

@nijinjie

3 months ago

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat187

shareShare

Tianbao Xie

@tianbaox

3 months ago

🚀 OSWorld gets a major upgrade! OSWorld-Verified: 15 months community feedback → 300+ fixes (ambiguity, graders…), 50x faster eval through AWS parallelization More apple-to-apple comparison for reliable CUA evaluation ✨ 👇xlang.ai/blog/osworld-v…

thumb_up_off_alt134

chat_bubble_outline7

repeat29

shareShare

Feng Yao

@fengyao1909

3 months ago

⚡𝐅𝐏𝟖 makes RL faster — but at the cost of performance. We present 𝐅𝐥𝐚𝐬𝐡𝐑𝐋, the first 𝐨𝐩𝐞𝐧–𝐬𝐨𝐮𝐫𝐜𝐞 & 𝐰𝐨𝐫𝐤𝐢𝐧𝐠 𝐑𝐋 𝐫𝐞𝐜𝐢𝐩𝐞 that applies 𝐈𝐍𝐓𝟖/𝐅𝐏𝟖 for rollout 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐥𝐨𝐬𝐢𝐧𝐠 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 compared to 𝐁𝐅𝟏𝟔! 📝 Blog:

thumb_up_off_alt548

chat_bubble_outline11

repeat84

shareShare

Zhoujun (Jorge) Cheng

@chengzhoujun

3 months ago

Yes, brutally true. I tend to see LLM RL ≈ on-policy self-distilled SFT with reward re-weighting. The key differences between LLM SFT (rejection sampling) and RL are: 1. Negative examples, or more precisely, advantage-weighted samples 2. On-policyness: Even iterated SFT is more

thumb_up_off_alt12

chat_bubble_outline0

repeat2

shareShare

Wen-Tse Chen

@wenzechen2

3 months ago

[0/3] 🚀 Introducing Verlog – an open-source RL framework built specifically for training long-horizon, multi-turn LLM agents. 📊 Max episode length comparison: •VeRL / RAGEN → ~10 turns •verl-agent → ~50 turns •Verlog (ours) → 400+ turns 🔥 ⚙️ Technical foundation:

thumb_up_off_alt395

chat_bubble_outline2

repeat71

shareShare

Tianbao Xie

@tianbaox

3 months ago

Where are our computer‑use agents (CUA) standing on OSWorld‑Verified? Potentially already ~80%. We made this analysis, which summarizes the latest OSWorld-Verified submissions with 27 models evaluated over 369 tasks, and conducted a case study on the o3+Jedi-7B approach to

thumb_up_off_alt50

chat_bubble_outline2

repeat10

shareShare

Xinyuan Wang

@xywang626

3 months ago

We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] arxiv.org/abs/2508.09123 📌

thumb_up_off_alt454

chat_bubble_outline12

repeat99

shareShare

Prophet Arena

@prophetarena

3 months ago

🔮 Introducing Prophet Arena — the AI benchmark for general predictive intelligence. That is, can AI truly predict the future by connecting today’s dots? 👉 What makes it special? - It can’t be hacked. Most benchmarks saturate over time, but here models face live, unseen

thumb_up_off_alt1,1K

chat_bubble_outline85

repeat148

shareShare

Dynamics Lab

@dynamicslab_ai

3 months ago

Try Mirage 2 now → dynamicslab.ai Here are a few worlds we’ve created—starting from images and prompts. More in the thread 👇 2/

thumb_up_off_alt143

chat_bubble_outline2

repeat21

shareShare

Zhiting Hu

@zhitinghu

3 months ago

🔥 Super excited to launch Mirage 2 A big leap toward a general-purpose world engine for live interactive play 🎮 Hard to believe how far we've come in just one month since Mirage 1 ⏩ If you’re impressed by Genie 3, come play with Mirage 2 — It’s live, offering an extended

thumb_up_off_alt225

chat_bubble_outline11

repeat38

shareShare

Yichao Fu

@fuyichao123

3 months ago

Excited to share my 1st project as a Research Scientist Intern at Meta FAIR! Grateful to my mentor Jiawei Zhao for guidance, and to Yuandong Tian & Xuewei for their valuable advice and collaboration. Our work DeepConf explores local confidence for more accurate & efficient LLM reasoning!

thumb_up_off_alt85

chat_bubble_outline10

repeat13

shareShare

Yuandong Tian

@tydsh

3 months ago

We released DeepConf that can achieve 99.9% on AIME'25 with open source models with only 15% of the compute, compared to majority voting@512. The secret? Simple. Just to pruning the rollouts if they show a consecutive stream of low-confidence😀. Can be applied to any models

thumb_up_off_alt163

chat_bubble_outline6

repeat22

shareShare

Feng Yao

@fengyao1909

3 months ago

We are glad that TIS and FlashRL have received broad attention from the open-source community that they have been verified and supported (OpenRLHF Jian Hu, SkyRL NovaSky, REINFORCE++Jian Hu, OAT Zichen Liu)! A few updates on our blog and FlashRL package: (1) more in-depth

thumb_up_off_alt67

chat_bubble_outline0

repeat8

shareShare

Daria Soboleva

@dmsobol

3 months ago

Router wasn't learning at first, we debugged it step-by-step and showed you how despite perfect load balancing, routing can be completely useless. We root caused it and fixed the problem. Papers skip the methodology, but you can find all details in our part 3 of MoE 101 series

thumb_up_off_alt166

chat_bubble_outline5

repeat24

shareShare

Ari Holtzman

@universeinanegg

3 months ago

One of the reasons academic science is so bad at producing or even accepting novelty is because the focus on hypothesis driven science has caused people to view exploratory studies as inherently unrigorous.

thumb_up_off_alt44

chat_bubble_outline1

repeat2

shareShare