XLANG NLP Lab (@xlangnlp) 's Twitter Profile
XLANG NLP Lab

@xlangnlp

developing embodied AI agents that empower users to use language to interact with digital and physical environments to carry out real-world tasks.

ID: 1678044379121057792

linkhttps://xlang.ai calendar_today09-07-2023 14:12:50

103 Tweet

894 Followers

27 Following

Bowen Wang (@bowenwangnlp) 's Twitter Profile Photo

๐ŸŽฎ Computer Use Agent Arena is LIVE! ๐Ÿš€ ๐Ÿ”ฅ Easiest way to test computer-use agents in the wild without any setup ๐ŸŒŸ Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more ๐Ÿ•น๏ธ Test agents on 100+ real apps & webs with one-click config ๐Ÿ”’ Safe & free

Tianbao Xie (@tianbaox) 's Twitter Profile Photo

Finally we are here! ๐Ÿ‘ Check out our most open & fair benchmarkโš”๏ธ for computer use capability evaluation for the community.

Tao Yu (@taoyds) 's Twitter Profile Photo

๐Ÿš€After a year of development based on our OSWorld, Computer Use Agent Arena is LIVE! Test top AI agents (Operator, Claude 3.7...) on any kinds of computer use tasks with zero setup. Cloud-hosted, safe, and FREE! Try it now: arena.xlang.ai ! Data & code coming soon!

XLANG NLP Lab (@xlangnlp) 's Twitter Profile Photo

๐Ÿ‘‰Compare and test Computer Use Agents (Operator, Claude 3.7...) on any kinds of tasks in real computers ๐Ÿšฉwithout any setup and cost๐Ÿšฉ! Try our Computer Use Agent Arena: arena.xlang.ai

XLANG NLP Lab (@xlangnlp) 's Twitter Profile Photo

๐Ÿš€ Exciting news! OpenAI's o3 & o4-mini, the most capable reasoning models, are now live on Computer Agent Arena! Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!

๐Ÿš€ Exciting news! <a href="/OpenAI/">OpenAI</a>'s o3 &amp; o4-mini, the most capable reasoning models, are now live on Computer Agent Arena!
Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!
XLANG NLP Lab (@xlangnlp) 's Twitter Profile Photo

๐ŸŽ‰ UI-TARS-1.5 is now live on Computer Agent Arena! Currently the SOTA model across multiple GUI benchmarks, showcasing leading performance in computer use, browser use, and even gameplay. Want to try the most intelligent CUA so far? Go to arena.xlang.ai.

๐ŸŽ‰ UI-TARS-1.5 is now live on Computer Agent Arena!  

Currently the SOTA model across multiple GUI benchmarks, showcasing leading performance in computer use, browser use, and even gameplay.  

Want to try the most intelligent CUA so far? Go to arena.xlang.ai.
XLANG NLP Lab (@xlangnlp) 's Twitter Profile Photo

๐Ÿ† Leaderboard Update! ๐Ÿš€ Claude 3.7 Sonnet from Anthropic ties #1 in Computer Agent Arena, followed by Operator from OpenAI & UI-TARS-1.5 from ByteDance, which is significantly different from prior benchmarks! Check the full rankings! ๐Ÿ‘‰ arena.xlang.ai/leaderboard

๐Ÿ† Leaderboard Update!
๐Ÿš€ Claude 3.7 Sonnet from <a href="/AnthropicAI/">Anthropic</a> ties #1 in Computer Agent Arena, followed by Operator from <a href="/OpenAI/">OpenAI</a> &amp; UI-TARS-1.5 from <a href="/BytedanceTalk/">ByteDance</a>, which is significantly different from prior benchmarks!

Check the full rankings! ๐Ÿ‘‰ arena.xlang.ai/leaderboard
Bowen Wang (@bowenwangnlp) 's Twitter Profile Photo

๐Ÿ˜€Our initial leaderboard finally came out, here I'd like to share a few interesting findings based on our case study: 1, Claude 3.7 Sonnet consistently performs best across diverse task types, particularly excelling at open-ended queries like โ€œwrite a paper reading report.โ€ 2,