Honghua Dong (@honghua_dong) 's Twitter Profile
Honghua Dong

@honghua_dong

Ph.D. student @UofTCompSci & @VectorInst

ID: 1654857231622438913

calendar_today06-05-2023 14:35:03

15 Tweet

38 Takipçi

45 Takip Edilen

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Identifying the Risks of LM Agents with an LM-Emulated Sandbox Presents a framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios proj: toolemu.com abs: arxiv.org/abs/2309.15817

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Presents a framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios

proj: toolemu.com
abs: arxiv.org/abs/2309.15817
Yangjun Ruan (@yangjunr) 's Twitter Profile Photo

Should you let LMs control your email? terminal? bank account? or even your smart home?🤔 🔥Introducing ToolEmu for identifying risks associated with LM agents at scale! 🛠️Featuring LM-emulation of tools & automated realistic risk detection 🚨GPT4 is risky in 40% of our cases!

Yangjun Ruan (@yangjunr) 's Twitter Profile Photo

Explore more about ToolEmu: 🚀 Try our demo and red-team the agents: demo.toolemu.com 🎯 🌐 Website: toolemu.com 📄 Paper: arxiv.org/abs/2309.15817 🔗 Code: github.com/ryoungj/toolemu

Shunyu Yao (@shunyuyao12) 's Twitter Profile Photo

Very cool work, analyzing the risks and robustness of ReAct agents across scenarios and base LLMs! This direction will be very important.

Honghua Dong (@honghua_dong) 's Twitter Profile Photo

Just tested out Grok3 for the same problem (4 4 10 10 for Game 24). It's glad that XAI reveals the thoughts so that the thinking process can be analyzed. From this example, the problem of underthinking does exist (not explored hard enough). It gives up too early after seeing

Just tested out Grok3 for the same problem (4 4 10 10 for Game 24).
It's glad that XAI reveals the thoughts so that the thinking process can be analyzed.

From this example, the problem of underthinking does exist (not explored hard enough). It gives up too early after seeing
Sierra (@sierraplatform) 's Twitter Profile Photo

Last year, we introduced 𝜏-bench, a benchmark for evaluating AI agents on realistic, multi-step tasks involving tool use and domain-specific constraints. It surfaced a critical limitation in LLM-based agents: low repeatability, even under identical conditions. Now, we’re

Yuhe Sissi Jiang (@snowfossi000) 's Twitter Profile Photo

👋 First time posting here to share some great news! 🎉 Accepted at ICML2025: we built TypyBench and tested how well LLMs perform on type inference for Python repos. Spoiler: most SOTA models struggle more than expected! 😅 📰: arxiv.org/abs/2507.22086 💻: github.com/typybench/typy…

👋 First time posting here to share some great news!
🎉 Accepted at ICML2025: we built TypyBench and tested how well LLMs perform on type inference for Python repos.
Spoiler: most SOTA models struggle more than expected! 😅
📰: arxiv.org/abs/2507.22086
💻: github.com/typybench/typy…