Tao Yu (@taoyds) Twitter Tweets • TwiCopy

chang ma

7 months ago

Excited to share our work at ICLR 2025 in 🇸🇬. ICLR 2026 🥳 Happy to chat about LLM reasoning & planning, agents, and AI4Science! 📍Sat 26 Apr 3 p.m. CST — 5:30 p.m Hall 3 + Hall 2B #554

Excited to share our work at ICLR 2025 in 🇸🇬. <a href="/iclr_conf/">ICLR 2026</a> 🥳 Happy to chat about LLM reasoning & planning, agents, and AI4Science!

📍Sat 26 Apr 3 p.m. CST — 5:30 p.m Hall 3 + Hall 2B #554

thumb_up_off_alt33

chat_bubble_outline0

repeat6

shareShare

📢 GreaterPrompt is Now Live! We're excited to introduce GreaterPrompt, a unified, customizable, and high-performance open-source toolkit for prompt optimization. 🔍 Key Features: - 5 Optimization Methods: APO, APE, PE2, GReaTer, and TextGrad - 4 Model Families: GPT, Mistral,

thumb_up_off_alt11

chat_bubble_outline0

repeat4

shareShare

XLANG NLP Lab

@xlangnlp

7 months ago

🚀 Exciting news! OpenAI's o3 & o4-mini, the most capable reasoning models, are now live on Computer Agent Arena! Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!

🚀 Exciting news! <a href="/OpenAI/">OpenAI</a>'s o3 & o4-mini, the most capable reasoning models, are now live on Computer Agent Arena!
Test, vote, and explore their full potential with CUAs at arena.xlang.ai! Join the community and dive in!

thumb_up_off_alt14

chat_bubble_outline2

repeat4

shareShare

Tao Yu

@taoyds

7 months ago

👉Try UI-TARS-1.5 and more other computer use agents (Operator, Claude 3.7) at arena.xlang.ai!

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Tao Yu

@taoyds

7 months ago

Computer use often involves long contexts, and users frequently tweak or follow up on requests. Though Claude 3.7/Operator aren’t perfect, this example shows their engaging and instruction-following abilities are growing (see the arena example): arena.xlang.ai/share_preview/….

thumb_up_off_alt15

chat_bubble_outline1

repeat5

shareShare

Deep Learning For Code @ ICLR'25

@dl4code

7 months ago

DL4C is going wild with Tao Yu 's talk on multimodal code gen and Xingyao Wang 's talk on OpenHands agents. #ICLR #ICLR2025

DL4C is going wild with <a href="/taoyds/">Tao Yu</a> 's talk on multimodal code gen and <a href="/xingyaow_/">Xingyao Wang</a> 's talk on OpenHands agents.

#ICLR #ICLR2025

thumb_up_off_alt18

chat_bubble_outline0

repeat5

shareShare

Qwen

@alibaba_qwen

7 months ago

Introducing Qwen3! We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general

thumb_up_off_alt7,7K

chat_bubble_outline316

repeat1,1K

shareShare

XLANG NLP Lab

@xlangnlp

7 months ago

🏆 Leaderboard Update! 🚀 Claude 3.7 Sonnet from Anthropic ties #1 in Computer Agent Arena, followed by Operator from OpenAI & UI-TARS-1.5 from ByteDance, which is significantly different from prior benchmarks! Check the full rankings! 👉 arena.xlang.ai/leaderboard

🏆 Leaderboard Update!
🚀 Claude 3.7 Sonnet from <a href="/AnthropicAI/">Anthropic</a> ties #1 in Computer Agent Arena, followed by Operator from <a href="/OpenAI/">OpenAI</a> & UI-TARS-1.5 from <a href="/BytedanceTalk/">ByteDance</a>, which is significantly different from prior benchmarks!

Check the full rankings! 👉 arena.xlang.ai/leaderboard

thumb_up_off_alt89

chat_bubble_outline2

repeat23

shareShare

Bowen Wang

@bowenwangnlp

7 months ago

😀Our initial leaderboard finally came out, here I'd like to share a few interesting findings based on our case study: 1, Claude 3.7 Sonnet consistently performs best across diverse task types, particularly excelling at open-ended queries like “write a paper reading report.” 2,

thumb_up_off_alt14

chat_bubble_outline1

repeat5

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

7 months ago

Check out the latest release from Computer Agent Arena!

thumb_up_off_alt105

chat_bubble_outline0

repeat12

shareShare

Tao Yu

@taoyds

7 months ago

🤔Static CUA benchmarks enable fast model dev but lack task variety and risk overfitting. Computer Agent Arena tests crowdsourced real-world tasks. OSWorld: 🥇UI-Tars1.5🥈Operator🥉Claude 3.7 CUA Arena: 🥇Claude 3.7🥈Operator🥉UI-Tars1.5 🚀Rankings likely to evolve quickly

thumb_up_off_alt36

chat_bubble_outline0

repeat11

shareShare

Diyi Yang

@diyi_yang

6 months ago

🚀 Introducing CAVA: The Comprehensive Assessment for Voice Assistants A new benchmark for evaluating end-to-end, speech-in-speech-out voice assistants in real-world scenarios. We go beyond single tasks or metrics to test the capabilities required for voice assistants:

thumb_up_off_alt174

chat_bubble_outline4

repeat32

shareShare

Yu Su @#ICLR2025

@ysu_nlp

6 months ago

New AI/LLM Agents Track at #EMNLP2025! In the past few years, it feels a bit odd to submit agent work to *CL venues because one had to awkwardedly fit it into Question Answering or NLP Applications. Glad to see agent research finally finds home at *CL! Kudos to the PC for

thumb_up_off_alt183

chat_bubble_outline9

repeat22

shareShare

ComputerUseAgents Workshop

@workshopcua

6 months ago

We're excited to invite Victor Zhong (Victor Zhong) as a speaker at the workshop on Computer Use Agents - ICML Conference 2025! 🤖💻 He is an Assistant Professor at the University of Waterloo and a Canada CIFAR AI Chair at the Vector Institute. His research focuses on enabling and

We're excited to invite Victor Zhong (<a href="/hllo_wrld/">Victor Zhong</a>) as a speaker at the workshop on Computer Use Agents - <a href="/icmlconf/">ICML Conference</a> 2025! 🤖💻
He is an Assistant Professor at the University of Waterloo and a Canada CIFAR AI Chair at the Vector Institute. His research focuses on enabling and

thumb_up_off_alt8

chat_bubble_outline1

repeat3

shareShare

AK

@_akhaliq

6 months ago

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

thumb_up_off_alt172

chat_bubble_outline1

repeat33

shareShare

Tao Yu

@taoyds

6 months ago

Big congrats, Wei-Lin!

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

XLANG NLP Lab

@xlangnlp

6 months ago

💠Claude Opus 4 & Claude Sonnet 4 Welcome to the Computer Agent Arena🔥 Congratulations on the Anthropic team for the great release!

💠Claude Opus 4 & Claude Sonnet 4
Welcome to the Computer Agent Arena🔥
Congratulations on the <a href="/AnthropicAI/">Anthropic</a> team for the great release!

thumb_up_off_alt9

chat_bubble_outline2

repeat4

shareShare

Tao Yu

@taoyds

6 months ago

Try out Claude 4 on Computer Agent Arena!

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Caiming Xiong

@caimingxiong

6 months ago

Graphical user interface (GUI) grounding, one of the two key abilities (Grounding & Planning) for Computer-use Agent (e.g. Operator) that map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer-use agent

thumb_up_off_alt152

chat_bubble_outline3

repeat33

shareShare

XLANG NLP Lab

@xlangnlp

5 months ago

🔥New Computer Agent Arena Leaderboard Updates (2k+ user votes)! 🤔Which VLMs act better as computer use agents (CUAs)? 1, Claude Sonnet 4 🥇 2, Claude 3.7 Sonnet 🥈 3, UI-TARS-1.5 🥉 4, Operator More insights in the thread 👇 arena.xlang.ai

thumb_up_off_alt38

chat_bubble_outline1

repeat18

shareShare

Tao Yu

chang ma

Rui Zhang

XLANG NLP Lab

Tao Yu

Tao Yu

Deep Learning For Code @ ICLR'25

Qwen

XLANG NLP Lab

Bowen Wang

lmarena.ai (formerly lmsys.org)

Tao Yu

Diyi Yang

Yu Su @#ICLR2025

ComputerUseAgents Workshop

AK

Tao Yu

XLANG NLP Lab

Tao Yu

Caiming Xiong

XLANG NLP Lab