Honghua Dong (@honghua_dong) Twitter Tweets • TwiCopy

Honghua Dong

@honghua_dong

+ Follow

Ph.D. student @UofTCompSci & @VectorInst

ID: 1654857231622438913

calendar_today06-05-2023 14:35:03

15 Tweet

38 Followers

45 Following

Aran Komatsuzaki

@arankomatsuzaki

2 years ago

Identifying the Risks of LM Agents with an LM-Emulated Sandbox Presents a framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios proj: toolemu.com abs: arxiv.org/abs/2309.15817

thumb_up_off_alt122

chat_bubble_outline0

repeat36

shareShare

Yangjun Ruan

@yangjunr

2 years ago

Should you let LMs control your email? terminal? bank account? or even your smart home?🤔 🔥Introducing ToolEmu for identifying risks associated with LM agents at scale! 🛠️Featuring LM-emulation of tools & automated realistic risk detection 🚨GPT4 is risky in 40% of our cases!

thumb_up_off_alt174

chat_bubble_outline6

repeat49

shareShare

Yangjun Ruan

@yangjunr

2 years ago

Explore more about ToolEmu: 🚀 Try our demo and red-team the agents: demo.toolemu.com 🎯 🌐 Website: toolemu.com 📄 Paper: arxiv.org/abs/2309.15817 🔗 Code: github.com/ryoungj/toolemu

thumb_up_off_alt10

chat_bubble_outline1

repeat3

shareShare

Shunyu Yao

@shunyuyao12

2 years ago

Very cool work, analyzing the risks and robustness of ReAct agents across scenarios and base LLMs! This direction will be very important.

thumb_up_off_alt17

chat_bubble_outline0

repeat4

shareShare

Honghua Dong

@honghua_dong

10 months ago

Just tested out Grok3 for the same problem (4 4 10 10 for Game 24). It's glad that XAI reveals the thoughts so that the thinking process can be analyzed. From this example, the problem of underthinking does exist (not explored hard enough). It gives up too early after seeing

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Sierra

@sierraplatform

6 months ago

Last year, we introduced 𝜏-bench, a benchmark for evaluating AI agents on realistic, multi-step tasks involving tool use and domain-specific constraints. It surfaced a critical limitation in LLM-based agents: low repeatability, even under identical conditions. Now, we’re

thumb_up_off_alt29

chat_bubble_outline1

repeat4

shareShare

Yuhe Sissi Jiang

@snowfossi000

4 months ago

👋 First time posting here to share some great news! 🎉 Accepted at ICML2025: we built TypyBench and tested how well LLMs perform on type inference for Python repos. Spoiler: most SOTA models struggle more than expected! 😅 📰: arxiv.org/abs/2507.22086 💻: github.com/typybench/typy…

thumb_up_off_alt5

chat_bubble_outline7

repeat1

shareShare