Tao Yu (@taoyds) 's Twitter Profile
Tao Yu

@taoyds

@XLangNLP lab, asst. prof. @HKUniversity. prev. postdoc @uwnlp; phd @Yale; he/him 🌈

ID: 709213739145437184

linkhttps://taoyds.github.io/ calendar_today14-03-2016 03:05:06

426 Tweet

4,4K Followers

869 Following

chang ma (@ma_chang_nlp) 's Twitter Profile Photo

Excited to share our work at ICLR 2025 in 🇸🇬. ICLR 2026 🥳 Happy to chat about LLM reasoning & planning, agents, and AI4Science! 📍Sat 26 Apr 3 p.m. CST — 5:30 p.m Hall 3 + Hall 2B #554

Excited to share our work at ICLR 2025 in 🇸🇬. <a href="/iclr_conf/">ICLR 2026</a> 🥳 Happy to chat about LLM reasoning &amp; planning, agents, and AI4Science! 

📍Sat 26 Apr 3 p.m. CST — 5:30 p.m Hall 3 + Hall 2B #554
Rui Zhang (@ruizhang_nlp) 's Twitter Profile Photo

📢 GreaterPrompt is Now Live! We're excited to introduce GreaterPrompt, a unified, customizable, and high-performance open-source toolkit for prompt optimization. 🔍 Key Features: - 5 Optimization Methods: APO, APE, PE2, GReaTer, and TextGrad - 4 Model Families: GPT, Mistral,

XLANG NLP Lab (@xlangnlp) 's Twitter Profile Photo

🚀 Exciting news! OpenAI's o3 & o4-mini, the most capable reasoning models, are now live on Computer Agent Arena! Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!

🚀 Exciting news! <a href="/OpenAI/">OpenAI</a>'s o3 &amp; o4-mini, the most capable reasoning models, are now live on Computer Agent Arena!
Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!
Tao Yu (@taoyds) 's Twitter Profile Photo

Computer use often involves long contexts, and users frequently tweak or follow up on requests. Though Claude 3.7/Operator aren’t perfect, this example shows their engaging and instruction-following abilities are growing (see the arena example): arena.xlang.ai/share_preview/….

Qwen (@alibaba_qwen) 's Twitter Profile Photo

Introducing Qwen3! We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general

Introducing Qwen3! 

We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general
XLANG NLP Lab (@xlangnlp) 's Twitter Profile Photo

🏆 Leaderboard Update! 🚀 Claude 3.7 Sonnet from Anthropic ties #1 in Computer Agent Arena, followed by Operator from OpenAI & UI-TARS-1.5 from ByteDance, which is significantly different from prior benchmarks! Check the full rankings! 👉 arena.xlang.ai/leaderboard

🏆 Leaderboard Update!
🚀 Claude 3.7 Sonnet from <a href="/AnthropicAI/">Anthropic</a> ties #1 in Computer Agent Arena, followed by Operator from <a href="/OpenAI/">OpenAI</a> &amp; UI-TARS-1.5 from <a href="/BytedanceTalk/">ByteDance</a>, which is significantly different from prior benchmarks!

Check the full rankings! 👉 arena.xlang.ai/leaderboard
Bowen Wang (@bowenwangnlp) 's Twitter Profile Photo

😀Our initial leaderboard finally came out, here I'd like to share a few interesting findings based on our case study: 1, Claude 3.7 Sonnet consistently performs best across diverse task types, particularly excelling at open-ended queries like “write a paper reading report.” 2,

Tao Yu (@taoyds) 's Twitter Profile Photo

🤔Static CUA benchmarks enable fast model dev but lack task variety and risk overfitting. Computer Agent Arena tests crowdsourced real-world tasks. OSWorld: 🥇UI-Tars1.5🥈Operator🥉Claude 3.7 CUA Arena: 🥇Claude 3.7🥈Operator🥉UI-Tars1.5 🚀Rankings likely to evolve quickly

🤔Static CUA benchmarks enable fast model dev but lack task variety and risk overfitting. 

Computer Agent Arena tests crowdsourced real-world tasks.

OSWorld: 🥇UI-Tars1.5🥈Operator🥉Claude 3.7
CUA Arena: 🥇Claude 3.7🥈Operator🥉UI-Tars1.5

🚀Rankings likely to evolve quickly
Diyi Yang (@diyi_yang) 's Twitter Profile Photo

🚀 Introducing CAVA: The Comprehensive Assessment for Voice Assistants A new benchmark for evaluating end-to-end, speech-in-speech-out voice assistants in real-world scenarios. We go beyond single tasks or metrics to test the capabilities required for voice assistants:

🚀 Introducing CAVA: The Comprehensive Assessment for Voice Assistants

A new benchmark for evaluating end-to-end, speech-in-speech-out voice assistants in real-world scenarios.

We go beyond single tasks or metrics to test the capabilities required for voice assistants:
Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

New AI/LLM Agents Track at #EMNLP2025! In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! Kudos to the PC for

New AI/LLM Agents Track at #EMNLP2025! 

In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! 

Kudos to the PC for
ComputerUseAgents Workshop (@workshopcua) 's Twitter Profile Photo

We're excited to invite Victor Zhong (Victor Zhong) as a speaker at the workshop on Computer Use Agents - ICML Conference 2025! 🤖💻 He is an Assistant Professor at the University of Waterloo and a Canada CIFAR AI Chair at the Vector Institute. His research focuses on enabling and

We're excited to invite Victor Zhong (<a href="/hllo_wrld/">Victor Zhong</a>) as a speaker at the workshop on Computer Use Agents - <a href="/icmlconf/">ICML Conference</a>  2025! 🤖💻
He is an Assistant Professor at the University of Waterloo and a Canada CIFAR AI Chair at the Vector Institute. His research focuses on enabling and
Caiming Xiong (@caimingxiong) 's Twitter Profile Photo

Graphical user interface (GUI) grounding, one of the two key abilities (Grounding & Planning) for Computer-use Agent (e.g. Operator) that map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer-use agent

Graphical user interface (GUI) grounding, one of the two key abilities (Grounding &amp; Planning) for Computer-use Agent (e.g. Operator) that map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer-use agent
XLANG NLP Lab (@xlangnlp) 's Twitter Profile Photo

🔥New Computer Agent Arena Leaderboard Updates (2k+ user votes)! 🤔Which VLMs act better as computer use agents (CUAs)? 1, Claude Sonnet 4 🥇 2, Claude 3.7 Sonnet 🥈 3, UI-TARS-1.5 🥉 4, Operator More insights in the thread 👇 arena.xlang.ai

🔥New Computer Agent Arena Leaderboard Updates (2k+ user votes)!
🤔Which VLMs act better as computer use agents (CUAs)?

1, Claude Sonnet 4 🥇
2, Claude 3.7 Sonnet 🥈
3, UI-TARS-1.5 🥉
4, Operator

More insights in the thread 👇
arena.xlang.ai