Haoyi Qiu (@haoyiqiu) Twitter Tweets • TwiCopy

Isadora White

4 months ago

The real world is an embodied multi-agent system with natural language communication. What if we had a benchmark and platform to study those challenges? ⛏️Introducing MINDcraft and MineCollab, the 1st platform and benchmark for studying embodied multi-agent LLM collaboration!

thumb_up_off_alt145

chat_bubble_outline10

repeat35

shareShare

Kung-Hsiang Steeve Huang

@steeve__huang

4 months ago

Excited to present our CRMArena paper at #NAACL2025 as an oral presentation! 🎉 ⏰ Tomorrow (April 30) 16:00-17:30 📍 Ballroom A (Session D) Looking forward to sharing work from Salesforce AI Research. Also excited to chat about agentic AI, multi-modality, and related topics!

thumb_up_off_alt27

chat_bubble_outline0

repeat3

shareShare

Chien-Sheng (Jason) Wu

@jasonwu0731

4 months ago

Check our work at #NAACL2025! ✨ CRMArena: Enterprise synthetic data and agent eval Kung-Hsiang Steeve Huang ✨ Evaluate RAG with sub-question coverage Kaige Xie ✨ Cultural and Social Awareness of LLM Agents Haoyi Qiu ✨ ReIFE: Meta-eval of instruction-following. Yixin Liu

Check our work at #NAACL2025!

✨ CRMArena: Enterprise synthetic data and agent eval <a href="/steeve__huang/">Kung-Hsiang Steeve Huang</a>

✨ Evaluate RAG with sub-question coverage <a href="/KaigeXie/">Kaige Xie</a>

✨ Cultural and Social Awareness of LLM Agents <a href="/HaoyiQiu/">Haoyi Qiu</a>

✨ ReIFE: Meta-eval of instruction-following. <a href="/YixinLiu17/">Yixin Liu</a>

thumb_up_off_alt27

chat_bubble_outline0

repeat5

shareShare

Salesforce AI Research

@sfresearch

4 months ago

Headed to #NAACL2025 today? Come visit Chien-Sheng (Jason) Wu and our team of researchers's latest work on CRMArena and more!

thumb_up_off_alt7

chat_bubble_outline0

repeat3

shareShare

Yunzhi Yao

@yyztodd

4 months ago

🚨 New Blog Drop! 🚀 "Reflection on Knowledge Editing: Charting the Next Steps" is live! 💡 Ever wondered why knowledge editing in LLMs still feels more like a lab experiment than a real-world solution? In this post, we dive deep into where the research is thriving — and where

thumb_up_off_alt36

chat_bubble_outline0

repeat16

shareShare

Kung-Hsiang Steeve Huang

@steeve__huang

4 months ago

Excited to share that CogAlign is accepted at #ACL2025 Findings! We investigated the "Jagged Intelligence" of VLMs – their surprising difficulty with basic visual arithmetics (e.g., counting objects, measuring angles) compared to their strong performance on harder visual tasks.

thumb_up_off_alt53

chat_bubble_outline3

repeat9

shareShare

Chien-Sheng (Jason) Wu

@jasonwu0731

4 months ago

Top 2 takeaways from our work: 1. VLM visual features do contain info for visual arithmetic—but without fine-tuning a strong decoder, it remains locked. 2. Training VLMs on just 8 invariant properties can enhance chart and visual math tasks, matching SFT with 60% less data.

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare

Kung-Hsiang Steeve Huang

@steeve__huang

3 months ago

Cultural safety in AI isn't just nice-to-have, it's essential ✅ Our new paper reveals that leading VLMs struggle with cultural appropriateness across different contexts. We developed CROSS, a multimodal cultural safety benchmark spanning 16 countries and 14 languages, to

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Stella Li

@stellalisy

3 months ago

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

thumb_up_off_alt1,1K

chat_bubble_outline69

repeat322

shareShare

Yung-Sung Chuang

@yungsungchuang

3 months ago

🚨Do passage rerankers really need explicit reasoning?🤔—Maybe Not! Our findings: ⚖️Standard rerankers outperform those w/ step-by-step reasoning! 🚫Disable reasoning from reasoning reranker actually improves reranking accuracy!🤯 👇But, why? 📰arxiv.org/abs/2505.16886 (1/6)

thumb_up_off_alt54

chat_bubble_outline1

repeat16

shareShare

Kung-Hsiang Steeve Huang

@steeve__huang

3 months ago

🚨 The Business AI Plot Thickens 🚨 CRMArena set the stage for business AI evaluation in realistic environments. Now we're back with CRMArena-Pro - a major expansion that extends to 19 work tasks across diverse business applications (sales, service, and CPQ processes). It covers

thumb_up_off_alt33

chat_bubble_outline7

repeat10

shareShare

Yike Wang

@yikewang_

3 months ago

LLMs are helpful for scientific research — but will they continuously be helpful? Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).

thumb_up_off_alt236

chat_bubble_outline10

repeat53

shareShare

Tanmay Parekh

@tparekh97

3 months ago

🚨 New work: LLMs still struggle at Event Detection due to poor long-context reasoning and inability to follow task constraints, causing precision and recall errors. We introduce DiCoRe — a lightweight 3-stage Divergent-Convergent reasoning framework to fix this.🧵📷 (1/N)

thumb_up_off_alt46

chat_bubble_outline1

repeat18

shareShare

elvis

@omarsar0

3 months ago

Andrej Karpathy Great share as usual! Just read this related piece where a study showed issues with LLM-based agents not recognizing sensitive information and not adhering to appropriate data handling protocols: theregister.com/2025/06/16/sal… paper: arxiv.org/abs/2505.18878

thumb_up_off_alt37

chat_bubble_outline0

repeat5

shareShare