Xing Han Lu (@xhluca) 's Twitter Profile
Xing Han Lu

@xhluca

Vibe agents @Mila_Quebec

ID: 943571700746211328

linkhttp://xinghanlu.com calendar_today20-12-2017 19:59:58

2,2K Tweet

2,2K Followers

290 Following

Xing Han Lu (@xhluca) 's Twitter Profile Photo

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories  

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.

We find that rule-based evals underreport success rates, and