Haoyi Qiu (@haoyiqiu) 's Twitter Profile
Haoyi Qiu

@haoyiqiu

Research intern @SFResearch ☁️ PhD student @UCLANLP 🧸 BS in CS&Math @UMich 〽️ #NLP 🌷

ID: 1054247114924912640

linkhttps://haoyiq114.github.io/ calendar_today22-10-2018 05:44:33

151 Tweet

900 Followers

779 Following

Isadora White (@isadorcw) 's Twitter Profile Photo

The real world is an embodied multi-agent system with natural language communication. What if we had a benchmark and platform to study those challenges? ⛏️Introducing MINDcraft and MineCollab, the 1st platform and benchmark for studying embodied multi-agent LLM collaboration!

Kung-Hsiang Steeve Huang (@steeve__huang) 's Twitter Profile Photo

Excited to present our CRMArena paper at #NAACL2025 as an oral presentation! 🎉 ⏰ Tomorrow (April 30) 16:00-17:30 📍 Ballroom A (Session D) Looking forward to sharing work from Salesforce AI Research. Also excited to chat about agentic AI, multi-modality, and related topics!

Chien-Sheng (Jason) Wu (@jasonwu0731) 's Twitter Profile Photo

Check our work at #NAACL2025! ✨ CRMArena: Enterprise synthetic data and agent eval Kung-Hsiang Steeve Huang ✨ Evaluate RAG with sub-question coverage Kaige Xie ✨ Cultural and Social Awareness of LLM Agents Haoyi Qiu ✨ ReIFE: Meta-eval of instruction-following. Yixin Liu

Check our work at #NAACL2025!

✨ CRMArena: Enterprise synthetic data and agent eval <a href="/steeve__huang/">Kung-Hsiang Steeve Huang</a> 

✨ Evaluate RAG with sub-question coverage <a href="/KaigeXie/">Kaige Xie</a> 

✨ Cultural and Social Awareness of LLM Agents <a href="/HaoyiQiu/">Haoyi Qiu</a> 

✨ ReIFE: Meta-eval of instruction-following. <a href="/YixinLiu17/">Yixin Liu</a>
Yunzhi Yao (@yyztodd) 's Twitter Profile Photo

🚨 New Blog Drop! 🚀 "Reflection on Knowledge Editing: Charting the Next Steps" is live! 💡 Ever wondered why knowledge editing in LLMs still feels more like a lab experiment than a real-world solution? In this post, we dive deep into where the research is thriving — and where

Kung-Hsiang Steeve Huang (@steeve__huang) 's Twitter Profile Photo

Excited to share that CogAlign is accepted at #ACL2025 Findings! We investigated the "Jagged Intelligence" of VLMs – their surprising difficulty with basic visual arithmetics (e.g., counting objects, measuring angles) compared to their strong performance on harder visual tasks.

Chien-Sheng (Jason) Wu (@jasonwu0731) 's Twitter Profile Photo

Top 2 takeaways from our work: 1. VLM visual features do contain info for visual arithmetic—but without fine-tuning a strong decoder, it remains locked. 2. Training VLMs on just 8 invariant properties can enhance chart and visual math tasks, matching SFT with 60% less data.

Kung-Hsiang Steeve Huang (@steeve__huang) 's Twitter Profile Photo

Cultural safety in AI isn't just nice-to-have, it's essential ✅ Our new paper reveals that leading VLMs struggle with cultural appropriateness across different contexts. We developed CROSS, a multimodal cultural safety benchmark spanning 16 countries and 14 languages, to

Stella Li (@stellalisy) 's Twitter Profile Photo

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

🤯 We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even work⁉️ Here's why: 🧵
Blogpost: tinyurl.com/spurious-rewar…
Yung-Sung Chuang (@yungsungchuang) 's Twitter Profile Photo

🚨Do passage rerankers really need explicit reasoning?🤔—Maybe Not! Our findings: ⚖️Standard rerankers outperform those w/ step-by-step reasoning! 🚫Disable reasoning from reasoning reranker actually improves reranking accuracy!🤯 👇But, why? 📰arxiv.org/abs/2505.16886 (1/6)

🚨Do passage rerankers really need explicit reasoning?🤔—Maybe Not!

Our findings:
⚖️Standard rerankers outperform those w/ step-by-step reasoning!
🚫Disable reasoning from reasoning reranker actually improves reranking accuracy!🤯
👇But, why?

📰arxiv.org/abs/2505.16886

(1/6)
Kung-Hsiang Steeve Huang (@steeve__huang) 's Twitter Profile Photo

🚨 The Business AI Plot Thickens 🚨 CRMArena set the stage for business AI evaluation in realistic environments. Now we're back with CRMArena-Pro - a major expansion that extends to 19 work tasks across diverse business applications (sales, service, and CPQ processes). It covers

🚨 The Business AI Plot Thickens 🚨

CRMArena set the stage for business AI evaluation in realistic environments. Now we're back with CRMArena-Pro - a major expansion that extends to 19 work tasks across diverse business applications (sales, service, and CPQ processes). It covers
Yike Wang (@yikewang_) 's Twitter Profile Photo

LLMs are helpful for scientific research — but will they continuously be helpful? Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).

LLMs are helpful for scientific research — but will they continuously be helpful?

Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).
Tanmay Parekh (@tparekh97) 's Twitter Profile Photo

🚨 New work: LLMs still struggle at Event Detection due to poor long-context reasoning and inability to follow task constraints, causing precision and recall errors. We introduce DiCoRe — a lightweight 3-stage Divergent-Convergent reasoning framework to fix this.🧵📷 (1/N)

🚨 New work: LLMs still struggle at Event Detection due to poor long-context reasoning and inability to follow task constraints, causing precision and recall errors.  

We introduce DiCoRe — a lightweight 3-stage Divergent-Convergent reasoning framework to fix this.🧵📷 (1/N)
elvis (@omarsar0) 's Twitter Profile Photo

Andrej Karpathy Great share as usual! Just read this related piece where a study showed issues with LLM-based agents not recognizing sensitive information and not adhering to appropriate data handling protocols: theregister.com/2025/06/16/sal… paper: arxiv.org/abs/2505.18878