Shuyan Zhou (@shuyanzhxyc) 's Twitter Profile
Shuyan Zhou

@shuyanzhxyc

incoming asst. prof @dukecompsci | llm agents @AIatMeta | past: @SCSatCMU @LTIatCMU

ID: 991258825893134336

linkhttps://shuyanzhou.com calendar_today01-05-2018 10:11:35

571 Tweet

2,2K Followers

774 Following

Chen Wu (@chenhenrywu) 's Twitter Profile Photo

1/🧵 2025 is the "Year of AI Agents" 🤖 But can you trust them to shop for you? Our latest work reveals critical security risks: improving AI agent performance (e.g., inference-time methods like reflection and tree search) can also make them more vulnerable to adversarial

1/🧵 2025 is the "Year of AI Agents" 🤖 But can you trust them to shop for you? Our latest work reveals critical security risks: improving AI agent performance (e.g., inference-time methods like reflection and tree search) can also make them more vulnerable to adversarial
Noam Brown (@polynoamial) 's Twitter Profile Photo

There’s a long history of AI doing well at games, but that’s typically involved the AI *training* on that game. What makes this result so cool and significant is that the model was never trained on Pokemon and yet still does well.

Hao Zhang (@haozhangml) 's Twitter Profile Photo

Hi Elon Musk, DOGE's actions directly or indirectly affect many academic institutions in a very significant way. Take myself as an example -- UCSD has frozen new faculty hiring for this academic year (and many other consequences). Many of brilliant minds, including your own

Zora Wang (@zhiruow) 's Twitter Profile Photo

Excited to share that our CowPilot🐮 is accepted to #NAACL 2025 Demo Track! Definitely check out our user study if you're interested in trying out CowPilot: forms.gle/aYkXhh7fdRwZ94…

Graham Neubig (@gneubig) 's Twitter Profile Photo

We're planning a big workshop on agents at CMU April 10-11, come join us! Current schedule: * Talks by Diyi Yang (Stanford) Qingyun Wu (Penn State/ag2), Aviral Kumar (CMU), me * Panel discussion on agent safety * Tutorials on relevant topics * Followed by weekend hackathon

Jason Wei (@_jasonwei) 's Twitter Profile Photo

Recently I have taken on a more defensive style of “debugging-prioritized” model training: - Create toy datasets that are easy to understand and have highly expected behavior as a sanity check for healthy ML training - Put a super high cost on additional complexity not directly

Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

🔥2025 is the year of agents, but are we there yet?🤔 🤯 "An Illusion of Progress? Assessing the Current State of Web Agents" –– our new study shows that frontier web agents may be far less competent (up to 59%) than previously reported! Why were benchmark numbers inflated? -

🔥2025 is the year of agents, but are we there yet?🤔

🤯 "An Illusion of Progress? Assessing the Current State of Web Agents" –– our new study shows that frontier web agents may be far less competent (up to 59%) than previously reported!

Why were benchmark numbers inflated?
-
Linlu Qiu (@linluqiu) 's Twitter Profile Photo

LLMs are increasingly used as agents that interact with users. To do so successfully, LLMs need to form beliefs and update them when new information becomes available. Do LLMs do so as expected from an optimal strategy? If not, can we get them to follow this strategy? 🧵

LLMs are increasingly used as agents that interact with users. To do so successfully, LLMs need to form beliefs and update them when new information becomes available. Do LLMs do so as expected from an optimal strategy? If not, can we get them to follow this strategy? 🧵
Graham Neubig (@gneubig) 's Twitter Profile Photo

Today's a big day! Months of work went into both of these releases, so we hope people enjoy them. OpenHands is now a great coding agent that you can run entirely locally (w/ OpenHands LM), and a great coding agent that you can run anywhere (w/ OpenHands Cloud).

Shuyan Zhou (@shuyanzhxyc) 's Twitter Profile Photo

we are excited to feature an incredible lineup of speakers—consider submitting your work and join us at our workshop at ICML 2025!

Bowen Wang (@bowenwangnlp) 's Twitter Profile Photo

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free

Xing Han Lu (@xhluca) 's Twitter Profile Photo

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories  

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.

We find that rule-based evals underreport success rates, and
Stefania Druga (@stefania_druga) 's Twitter Profile Photo

Official news: I moved to Tokyo 🗼 Japan. Will be here for the next 3-4 years. If you are into AI Research, Education &| multimodal prototypes 🗻🍵 and 🚲 I would love to meet you! (RT appreciated)

Official news: I moved to Tokyo 🗼 Japan. Will be here for the next 3-4 years. If you are into AI Research, Education &| multimodal prototypes 🗻🍵 and 🚲 I would love to meet you! (RT appreciated)
Brandon Trabucco @ ICLR (@brandontrabucco) 's Twitter Profile Photo

Building LLM Agents? Come to my talk at the #ICLR DATA-FM workshop today at 2:30pm, Hall 4, Section 4. I'll be presenting InSTA, our work building the largest environment for agents on the live internet. arxiv.org/abs/2502.06776 #Agents #LLM

Building LLM Agents? Come to my talk at the #ICLR DATA-FM workshop today at 2:30pm, Hall 4, Section 4.

I'll be presenting InSTA, our work building the largest environment for agents on the live internet.

arxiv.org/abs/2502.06776

#Agents #LLM
Hyungjoo Chae (@hyungjoochae) 's Twitter Profile Photo

🚀 Introducing Web-Shepherd: the first Process Reward Model (PRM) that guides web agents. 🌐 Current web browsing agents look cool, but they're not fully reliable! 😬They excel at simple tasks but struggle with complex ones. ❓ Can inference-time scaling help? Previous methods

Roy Xie (@royxie_) 's Twitter Profile Photo

Can we train reasoning LLMs to generate answers as they think? Introducing 𝐈𝐧𝐭𝐞𝐫𝐥𝐞𝐚𝐯𝐞𝐝 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠! We train LLMs to alternate between thinking & answering 🚀 Reducing Time-to-First-Token (TTFT) by over 80% ⚡AND improving Pass@1 accuracy up to 19.3%!📈 🧵 1/n

Can we train reasoning LLMs to generate answers as they think?
Introducing 𝐈𝐧𝐭𝐞𝐫𝐥𝐞𝐚𝐯𝐞𝐝 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠! We train LLMs to  alternate between thinking & answering 🚀
Reducing Time-to-First-Token (TTFT) by over 80% ⚡AND improving Pass@1 accuracy up to 19.3%!📈

🧵 1/n
All Hands AI (@allhands_ai) 's Twitter Profile Photo

What if we could have *trustworthy* agents that don't just write code, but also do research, understand multimodal content, and perform many practically useful tasks? Today at OpenHands, we released a new agent that gets SOTA or competitive performance on 8 diverse tasks.

Junhong Shen (@junhongshen1) 's Twitter Profile Photo

🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976

🔥Unlocking New Paradigm for Test-Time Scaling of Agents!

We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step.

Our agents learn to act longer➡️richer exploration➡️better success

Paper: arxiv.org/abs/2506.07976