Shuyan Zhou (@shuyanzhxyc) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

1/🧵 2025 is the "Year of AI Agents" 🤖 But can you trust them to shop for you? Our latest work reveals critical security risks: improving AI agent performance (e.g., inference-time methods like reflection and tree search) can also make them more vulnerable to adversarial

thumb_up_off_alt59

chat_bubble_outline2

repeat25

shareShare

Noam Brown

@polynoamial

4 months ago

There’s a long history of AI doing well at games, but that’s typically involved the AI *training* on that game. What makes this result so cool and significant is that the model was never trained on Pokemon and yet still does well.

thumb_up_off_alt1,1K

chat_bubble_outline78

repeat113

shareShare

Hao Zhang

@haozhangml

4 months ago

Hi Elon Musk, DOGE's actions directly or indirectly affect many academic institutions in a very significant way. Take myself as an example -- UCSD has frozen new faculty hiring for this academic year (and many other consequences). Many of brilliant minds, including your own

thumb_up_off_alt314

chat_bubble_outline12

repeat19

shareShare

Zora Wang

@zhiruow

4 months ago

Excited to share that our CowPilot🐮 is accepted to #NAACL 2025 Demo Track! Definitely check out our user study if you're interested in trying out CowPilot: forms.gle/aYkXhh7fdRwZ94…

thumb_up_off_alt37

chat_bubble_outline0

repeat67

shareShare

Graham Neubig

@gneubig

3 months ago

We're planning a big workshop on agents at CMU April 10-11, come join us! Current schedule: * Talks by Diyi Yang (Stanford) Qingyun Wu (Penn State/ag2), Aviral Kumar (CMU), me * Panel discussion on agent safety * Tutorials on relevant topics * Followed by weekend hackathon

thumb_up_off_alt157

chat_bubble_outline4

repeat108

shareShare

Jason Wei

@_jasonwei

3 months ago

Recently I have taken on a more defensive style of “debugging-prioritized” model training: - Create toy datasets that are easy to understand and have highly expected behavior as a sanity check for healthy ML training - Put a super high cost on additional complexity not directly

thumb_up_off_alt326

chat_bubble_outline8

repeat23

shareShare

Yu Su @#ICLR2025

@ysu_nlp

3 months ago

🔥2025 is the year of agents, but are we there yet?🤔 🤯 "An Illusion of Progress? Assessing the Current State of Web Agents" –– our new study shows that frontier web agents may be far less competent (up to 59%) than previously reported! Why were benchmark numbers inflated? -

thumb_up_off_alt230

chat_bubble_outline10

repeat66

shareShare

Linlu Qiu

@linluqiu

3 months ago

LLMs are increasingly used as agents that interact with users. To do so successfully, LLMs need to form beliefs and update them when new information becomes available. Do LLMs do so as expected from an optimal strategy? If not, can we get them to follow this strategy? 🧵

thumb_up_off_alt373

chat_bubble_outline3

repeat73

shareShare

Graham Neubig

@gneubig

3 months ago

Today's a big day! Months of work went into both of these releases, so we hope people enjoy them. OpenHands is now a great coding agent that you can run entirely locally (w/ OpenHands LM), and a great coding agent that you can run anywhere (w/ OpenHands Cloud).

thumb_up_off_alt162

chat_bubble_outline2

repeat24

shareShare

Shuyan Zhou

@shuyanzhxyc

3 months ago

this is big! the true general interface 😜

thumb_up_off_alt15

chat_bubble_outline1

repeat1

shareShare

Shuyan Zhou

@shuyanzhxyc

3 months ago

Here we go! Hope you enjoy 😉

thumb_up_off_alt47

chat_bubble_outline0

repeat2

shareShare

Shuyan Zhou

@shuyanzhxyc

3 months ago

we are excited to feature an incredible lineup of speakers—consider submitting your work and join us at our workshop at ICML 2025!

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare

Bowen Wang

@bowenwangnlp

3 months ago

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free

thumb_up_off_alt333

chat_bubble_outline14

repeat104

shareShare

Xing Han Lu

@xhluca

2 months ago

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

thumb_up_off_alt230

chat_bubble_outline4

repeat100

shareShare

Stefania Druga

@stefania_druga

2 months ago

Official news: I moved to Tokyo 🗼 Japan. Will be here for the next 3-4 years. If you are into AI Research, Education &| multimodal prototypes 🗻🍵 and 🚲 I would love to meet you! (RT appreciated)

thumb_up_off_alt474

chat_bubble_outline28

repeat29

shareShare

Brandon Trabucco @ ICLR

@brandontrabucco

2 months ago

Building LLM Agents? Come to my talk at the #ICLR DATA-FM workshop today at 2:30pm, Hall 4, Section 4. I'll be presenting InSTA, our work building the largest environment for agents on the live internet. arxiv.org/abs/2502.06776 #Agents #LLM

thumb_up_off_alt44

chat_bubble_outline2

repeat12

shareShare

Hyungjoo Chae

@hyungjoochae

a month ago

🚀 Introducing Web-Shepherd: the first Process Reward Model (PRM) that guides web agents. 🌐 Current web browsing agents look cool, but they're not fully reliable! 😬They excel at simple tasks but struggle with complex ones. ❓ Can inference-time scaling help? Previous methods

thumb_up_off_alt69

chat_bubble_outline2

repeat16

shareShare

Roy Xie

@royxie_

a month ago

Can we train reasoning LLMs to generate answers as they think? Introducing 𝐈𝐧𝐭𝐞𝐫𝐥𝐞𝐚𝐯𝐞𝐝 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠! We train LLMs to alternate between thinking & answering 🚀 Reducing Time-to-First-Token (TTFT) by over 80% ⚡AND improving Pass@1 accuracy up to 19.3%!📈 🧵 1/n

thumb_up_off_alt178

chat_bubble_outline1

repeat35

shareShare

All Hands AI

@allhands_ai

20 days ago

What if we could have *trustworthy* agents that don't just write code, but also do research, understand multimodal content, and perform many practically useful tasks? Today at OpenHands, we released a new agent that gets SOTA or competitive performance on 8 diverse tasks.

thumb_up_off_alt174

chat_bubble_outline5

repeat27

shareShare

Junhong Shen

@junhongshen1

15 days ago

🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976

thumb_up_off_alt154

chat_bubble_outline7

repeat36

shareShare

Shuyan Zhou

Gate.io

Chen Wu

Noam Brown

Hao Zhang

Zora Wang

Graham Neubig

Jason Wei

Yu Su @#ICLR2025

Linlu Qiu

Graham Neubig

Shuyan Zhou

Shuyan Zhou

Shuyan Zhou

Bowen Wang

Xing Han Lu

Stefania Druga

Brandon Trabucco @ ICLR

Hyungjoo Chae

Roy Xie

All Hands AI

Junhong Shen