Rui Yang (@ruiyang70669025) Twitter Tweets • TwiCopy

Yong Lin

4 months ago

🔥Our Goedel-Prover-V2-32B topped the PutnamBench Leaderboard by solving 86 problems —nearly 2× more than the previous SOTA DeepSeek-Prover-V2-671B (solved 47), while using: * 1/20 the model size (32B vs. 671B) * 1/5 the passes (184 vs. 1024) Meanwhile, we also release *

thumb_up_off_alt90

chat_bubble_outline2

repeat15

shareShare

Tianbao Xie

@tianbaox

4 months ago

🚀 OSWorld gets a major upgrade! OSWorld-Verified: 15 months community feedback → 300+ fixes (ambiguity, graders…), 50x faster eval through AWS parallelization More apple-to-apple comparison for reliable CUA evaluation ✨ 👇xlang.ai/blog/osworld-v…

thumb_up_off_alt134

chat_bubble_outline7

repeat29

shareShare

Chenlu Ye

@ye_chenlu

3 months ago

PROF🌀Right answer, flawed reason?🤔🌀 📄arxiv.org/pdf/2509.03403 Excited to share our work: PROF-PRocess cOnsistency Filter! 🚀 Challenge: ORM is blind to flawed logic, and PRM suffers from reward hacking. Our method harmonizes strengths of PRM & ORM. #LLM #ReinforcementLearning

thumb_up_off_alt37

chat_bubble_outline2

repeat11

shareShare

Manling Li

@manlingli_

3 months ago

Check out the 1st Behavior Challenge, co-host with our Foundation Models for Embodied Agent Challenge at NeurIPS …models-meet-embodied-agents.github.io/behavior_chall… When I first moved my focus from LLMs/VLMs toward embodied agents, I expected the biggest challenges would be around perception, motor

thumb_up_off_alt54

chat_bubble_outline2

repeat14

shareShare

Rui Yang

@ruiyang70669025

3 months ago

Check out the new benchmark accepted to NeurIPS 2025 DB Track! We evaluate model merging algorithms across instruction following, math, multilingual understanding, coding, and safety.

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Yujia Qin@ICLR2025

@tsingyoga

2 months ago

The tool/env infra behind UI-TARS-2 is open-sourced. Enjoy the All-in-One Agent Sandbox!🥳 sandbox.agent-infra.com github.com/agent-infra/sa…

thumb_up_off_alt230

chat_bubble_outline10

repeat38

shareShare

Cheng Qian

@qiancheng1231

2 months ago

🚀 Introducing UserRL: a new framework to train agents that truly assist users through proactive interaction, not just chase static benchmarking scores. 📄 Paper: arxiv.org/pdf/2509.19736 💻 Code: github.com/SalesforceAIRe…

thumb_up_off_alt218

chat_bubble_outline4

repeat45

shareShare

Xin Eric Wang @ ICLR 2025

@xwang_lk

2 months ago

🚀 Introducing 𝐀𝐠𝐞𝐧𝐭 𝐒3, the most advanced computer-use agent, now 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡𝐢𝐧𝐠 𝐡𝐮𝐦𝐚𝐧-𝐥𝐞𝐯𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞🧠💻 Just one year ago, Agent S scored ~20% on OSWorld: SOTA then, but far from human 72%. Today, Agent S3 reaches 6̳9̳.̳9̳%̳ (⬆10% over

thumb_up_off_alt1,1K

chat_bubble_outline67

repeat251

shareShare

Shizhe Diao

@shizhediao

2 months ago

🚀 Introducing BroRL: Scaling Reinforcement Learning via Broadened Exploration When step-scaling hits a plateau, scale rollouts, not steps. BroRL takes reinforcement learning beyond saturation—reviving stalled models by expanding exploration with large-N rollouts. 👇 (1/n)

thumb_up_off_alt207

chat_bubble_outline19

repeat42

shareShare

Hanze Dong @ ICLR 2025

@hendrydong

2 months ago

💥Thrilled to share our new work Reinforce-Ada, which fixes signal collapse in GRPO 🥳No more blind oversampling or dead updates. Just sharper gradients, faster convergence, and stronger models. ⚙️ One-line drop-in. Real gains. arxiv.org/html/2510.0499… github.com/RLHFlow/Reinfo…

thumb_up_off_alt182

chat_bubble_outline9

repeat24

shareShare

Zhenhailong Wang

@zhenhailongw

2 months ago

Multimodal conversational agents struggle to follow complex policies, which also impose a fixed computational cost. We ask: 👉 How can we achieve stronger policy-following behavior without having to include policies in-context? 🌐: mikewangwzhl.github.io/TriMPI/ 🧵1/3

thumb_up_off_alt37

chat_bubble_outline1

repeat12

shareShare

Rui Yang

@ruiyang70669025

2 months ago

🥳 Excited to share ERA: our training recipe for VLM-based embodied agents with interleaved perception + reasoning, tackling both high-level planning and low-level manipulation. We cover embodied-knowledge data curation and agent RL design. 🔎 Findings 1️⃣ Beyond

thumb_up_off_alt78

chat_bubble_outline1

repeat14

shareShare

Manling Li

@manlingli_

2 months ago

World Model Reasoning for VLM Agents (NeurIPS 2025, Score 5544) We release VAGEN to teach VLMs to build internal world models via visual state reasoning: - StateEstimation: what is the current state? - TransitionModeling: what is next? MDP → POMDP shift to handle the partial

thumb_up_off_alt298

chat_bubble_outline3

repeat66

shareShare

Manling Li

@manlingli_

a month ago

VLAs, VLMs, LLMs, and Vision Foundation Models for Embodied Agents! There are just so many new updates in recent months! We have updated our tutorial, come and join us if you would like to discuss the latest advances! Room: 306B Time: 1pm-5pm Slides: …models-meet-embodied-agents.github.io

thumb_up_off_alt364

chat_bubble_outline7

repeat40

shareShare

Daniel Kang

@daniel_d_kang

a month ago

🤖 Feeling excited about the future of household robotic agents (i.e., embodied agents)? You should also consider their safety! 🔪Meet BEAT: the first visual backdoor attack on MLLM-based embodied agents. 🧵 1/7

thumb_up_off_alt20

chat_bubble_outline1

repeat6

shareShare

Han Zhao

@hanzhao_ml

a month ago

Glad to share that our paper has been awarded the Outstanding Paper Award at EMNLP' 25!! I am not attending the conference, but please find Jingyan Shen and talk to her if you want to know more details!

thumb_up_off_alt59

chat_bubble_outline7

repeat4

shareShare

Rui Yang

@ruiyang70669025

a month ago

Thrilled to share our paper (arxiv.org/pdf/2505.24846) won an EMNLP 2025 Outstanding Paper Award! 🎉🎉 Huge congrats to the team Jingyan Shen Jiarui Yao Yifan Sun Feng Luo Rui Pan, and big thanks to our advisors Prof. Tong Zhang and Han Zhao!

Thrilled to share our paper (arxiv.org/pdf/2505.24846) won an EMNLP 2025 Outstanding Paper Award! 🎉🎉
Huge congrats to the team <a href="/evangelinejy99/">Jingyan Shen</a> <a href="/ExplainMiracles/">Jiarui Yao</a> <a href="/YifanSun99/">Yifan Sun</a> <a href="/FengLuo895614/">Feng Luo</a> <a href="/rui4research/">Rui Pan</a>, and big thanks to our advisors Prof. Tong Zhang and <a href="/hanzhao_ml/">Han Zhao</a>!

thumb_up_off_alt26

chat_bubble_outline0

repeat3

shareShare

Qineng Wang

@qineng_wang

9 days ago

Most VLM benchmarks watch the world; few ask how actions *change* it from a robot's eye. Embodied cognition tells us that intelligence isn't just watching – it's enacted through interaction. 👉We introduce ENACT: A benchmark that tests if VLMs can track the evolution of a

thumb_up_off_alt234

chat_bubble_outline6

repeat55

shareShare

Cua

@trycua

5 days ago

1/n We went through 45 Computer-Use Agent papers from NeurIPS 2025 - here's what stood out.

thumb_up_off_alt31

chat_bubble_outline3

repeat9

shareShare