Jiawei Wang (@jarvismsustc) Twitter Tweets • TwiCopy

Jiawei Wang

@jarvismsustc

+ Follow

Joint PhD Candidate @USTC and @MSFTResearch. Intern @MSFTResearch @deepseek_ai @BytedanceTalk

ID: 1460775848718610439

linkhttps://jarvisustc.github.io/ calendar_today17-11-2021 01:04:52

28 Tweet

42 Followers

201 Following

AK

@_akhaliq

4 months ago

WideSearch Benchmarking Agentic Broad Info-Seeking

thumb_up_off_alt89

chat_bubble_outline4

repeat15

shareShare

Excited to be part of Eval Sys and contribute to our debut milestone, MCPMark! Follow us for updates, and come join the team—we’re just getting started! Github： github.com/eval-sys/mcpma… Website： mcpmark.ai Huggingface trajectory log： huggingface.co/datasets/Jakum…

thumb_up_off_alt14

chat_bubble_outline0

repeat3

shareShare

Xiangyan Liu

@dobogiyy

4 months ago

Sharing some of my thoughts when developing, hope they can help 👇 1/ Choosing the initial state defines task diversity, difficulty, and usefulness. 2/ State tracking and management is the trickiest stage. Each MCP needs its own isolation strategy. Worth it though: sandboxing

thumb_up_off_alt11

chat_bubble_outline0

repeat5

shareShare

Eval Sys

@evalsysorg

3 months ago

MCPMark Leaderboard Update 🚀 🌟 Qwen-3-Coder takes the #1 spot among open-source models, with an impressive per-run cost of just $36.46. ⚡️ Grok-Code-Fast-1 delivers the lowest per-run cost ($16.08) and the fastest average agent time (156.63s) across the top 10 models.

thumb_up_off_alt134

chat_bubble_outline5

repeat22

shareShare

Jiawei Wang

@jarvismsustc

2 months ago

Our latest blog uses detailed experiments to deeply explore a key cause of RL training collapse: training-inference mismatch. It may provide useful references for your work, and we welcome any discussions around it.😊

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Jiawei Wang

@jarvismsustc

2 months ago

Welcome to evaluate your agents on our MCPMark!

thumb_up_off_alt6

chat_bubble_outline0

repeat2

shareShare

Yingru Li

@richardyrli

a month ago

Daniel Han, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug

<a href="/danielhanchen/">Daniel Han</a>, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog.
The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug

thumb_up_off_alt341

chat_bubble_outline8

repeat24

shareShare

Yingru Li

@richardyrli

a month ago

🚨 UPDATE to the "1 bit per episode" analysis (inspired by john schulman's post at Thinking Machines ): After discussion with mgostIH, I ned to points out the limit only applies to *scalar advantage*! REINFORCE with per-timestep advantages can learn O(T) bits when rewards are

thumb_up_off_alt17

chat_bubble_outline1

repeat8

shareShare