Honghua Dong (@honghua_dong) 's Twitter Profile
Honghua Dong

@honghua_dong

Ph.D. student @UofTCompSci & @VectorInst

ID: 1654857231622438913

calendar_today06-05-2023 14:35:03

15 Tweet

38 Followers

45 Following

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Identifying the Risks of LM Agents with an LM-Emulated Sandbox Presents a framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios proj: toolemu.com abs: arxiv.org/abs/2309.15817

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Presents a framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios

proj: toolemu.com
abs: arxiv.org/abs/2309.15817
Yangjun Ruan (@yangjunr) 's Twitter Profile Photo

Should you let LMs control your email? terminal? bank account? or even your smart home?๐Ÿค” ๐Ÿ”ฅIntroducing ToolEmu for identifying risks associated with LM agents at scale! ๐Ÿ› ๏ธFeaturing LM-emulation of tools & automated realistic risk detection ๐ŸšจGPT4 is risky in 40% of our cases!

Yangjun Ruan (@yangjunr) 's Twitter Profile Photo

Explore more about ToolEmu: ๐Ÿš€ Try our demo and red-team the agents: demo.toolemu.com ๐ŸŽฏ ๐ŸŒ Website: toolemu.com ๐Ÿ“„ Paper: arxiv.org/abs/2309.15817 ๐Ÿ”— Code: github.com/ryoungj/toolemu

Shunyu Yao (@shunyuyao12) 's Twitter Profile Photo

Very cool work, analyzing the risks and robustness of ReAct agents across scenarios and base LLMs! This direction will be very important.

Honghua Dong (@honghua_dong) 's Twitter Profile Photo

Just tested out Grok3 for the same problem (4 4 10 10 for Game 24). It's glad that XAI reveals the thoughts so that the thinking process can be analyzed. From this example, the problem of underthinking does exist (not explored hard enough). It gives up too early after seeing

Just tested out Grok3 for the same problem (4 4 10 10 for Game 24).
It's glad that XAI reveals the thoughts so that the thinking process can be analyzed.

From this example, the problem of underthinking does exist (not explored hard enough). It gives up too early after seeing
Sierra (@sierraplatform) 's Twitter Profile Photo

Last year, we introduced ๐œ-bench, a benchmark for evaluating AI agents on realistic, multi-step tasks involving tool use and domain-specific constraints. It surfaced a critical limitation in LLM-based agents: low repeatability, even under identical conditions. Now, weโ€™re

Yuhe Sissi Jiang (@snowfossi000) 's Twitter Profile Photo

๐Ÿ‘‹ First time posting here to share some great news! ๐ŸŽ‰ Accepted at ICML2025: we built TypyBench and tested how well LLMs perform on type inference for Python repos. Spoiler: most SOTA models struggle more than expected! ๐Ÿ˜… ๐Ÿ“ฐ: arxiv.org/abs/2507.22086 ๐Ÿ’ป: github.com/typybench/typyโ€ฆ

๐Ÿ‘‹ First time posting here to share some great news!
๐ŸŽ‰ Accepted at ICML2025: we built TypyBench and tested how well LLMs perform on type inference for Python repos.
Spoiler: most SOTA models struggle more than expected! ๐Ÿ˜…
๐Ÿ“ฐ: arxiv.org/abs/2507.22086
๐Ÿ’ป: github.com/typybench/typyโ€ฆ