Kung-Hsiang Steeve Huang (@steeve__huang) 's Twitter Profile
Kung-Hsiang Steeve Huang

@steeve__huang

Research Scientist @SFResearch | Formerly: PhD @UofIllinois, PhD Fellow @AmazonScience, MSc @USCViterbi, BEng @HKUST | He/him/his 🇹🇼 | #NLP

ID: 932462804056936448

linkhttp://khuangaf.github.io calendar_today20-11-2017 04:17:11

935 Tweet

1,1K Followers

274 Following

David Hendrickson (@teksedge) 's Twitter Profile Photo

Now that we are fully immersed in the year of Agentic AI, nobody has created a benchmark to evaluate how well these agents work together...until now. 👇 @Salesforce's CRMArena-Pro builds upon CRMArena with nineteen expert-validated tasks across sales, service, and 'configure,

Now that we are fully immersed in the year of Agentic AI, nobody has created a benchmark to evaluate how well these agents work together...until now.  
👇
@Salesforce's CRMArena-Pro builds upon CRMArena with nineteen expert-validated tasks across sales, service, and 'configure,
Salesforce AI Research (@sfresearch) 's Twitter Profile Photo

CRMArena-Pro reveals why enterprise AI deployment remains challenging—many top-performing agents struggle significantly on real-world business tasks. 👇Full technical breakdown from our research lead Kung-Hsiang Steeve Huang below. #EnterpriseAI #AgenticAI #EGI

Caiming Xiong (@caimingxiong) 's Twitter Profile Photo

AI agents are rapidly integrating into various industries, however their full potential remains underutilized due to performance inconsistencies and enterprise hesitation. To alleviate this issues, we introduce <CRMArena-Pro>, a novel enterprise-agent benchmark for holistic and

AI agents are rapidly integrating into various industries, however their full potential remains underutilized due to performance inconsistencies and enterprise hesitation.

To alleviate this issues, we introduce &lt;CRMArena-Pro&gt;, a novel enterprise-agent benchmark for holistic and
Shizhe Diao (@shizhediao) 's Twitter Profile Photo

Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough! Introducing ProRL 😎, a novel training recipe that scales RL to >2k steps, empowering the world’s leading 1.5B reasoning model💥and offering

Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough!

Introducing ProRL 😎, a novel training recipe that scales RL to &gt;2k steps, empowering the world’s leading 1.5B reasoning model💥and offering
Bony Bean (@bonybean) 's Twitter Profile Photo

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents: ift.tt/zs73bwj

Vlad Ruso PhD (@vlruso) 's Twitter Profile Photo

Salesforce AI Launches CRMArena-Pro: A Game-Changer for Evaluating LLM Agents in Business #CRMArenaPro #LLMAgents #SalesforceAI #CustomerExperience #DataPrivacy itinai.com/salesforce-ai-… Understanding CRMArena-Pro: A New Benchmark for LLM Agents Salesforce AI has introduced CRM…

Salesforce AI Launches CRMArena-Pro: A Game-Changer for Evaluating LLM Agents in Business #CRMArenaPro #LLMAgents #SalesforceAI #CustomerExperience #DataPrivacy
itinai.com/salesforce-ai-…

Understanding CRMArena-Pro: A New Benchmark for LLM Agents

Salesforce AI has introduced CRM…
Quantumbytz (@quantumbytz) 's Twitter Profile Photo

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents #AI #MachineLearning #IoT #LLM marktechpost.com/2025/06/05/sal…...

Kung-Hsiang Steeve Huang (@steeve__huang) 's Twitter Profile Photo

Thanks Marktechpost AI Research News ⚡ for covering CRMArena-Pro 🙏 Our new benchmark reveals that even the best LLM agents achieve only ~58% success rate on realistic business tasks, dropping to 35% in multi-turn scenarios. Also, confidentiality awareness is nearly non-existent across all models

Silvio Savarese (@silviocinguetta) 's Twitter Profile Photo

Synthesized data for #EnterpriseAI evaluation is an ethical imperative. CRMArena-Pro lets us rigorously test agents in a real-life business environment—a messy, multi-step, complex world—without putting sensitive data at risk. Proud of the team's work toward safer, more

Hou Pong (Ken) Chan (@kenchanhp) 's Twitter Profile Photo

🚀 Discover how LLMs perceive their knowledge boundaries across languages in our #ACL2025 main paper! 🌍 By probing LLMs’ internal representations, we reveal key insights on where knowledge boundaries are encoded & propose a training-free method to combat cross-lingual

🚀 Discover how LLMs perceive their knowledge boundaries across languages in our #ACL2025 main paper! 🌍
By probing LLMs’ internal representations, we reveal key insights on where knowledge boundaries are encoded &amp; propose a training-free method to combat cross-lingual
Chomba Bupe (@chombabupe) 's Twitter Profile Photo

Another paper drop, this time from Salesforce: "These results underscore a significant gap between current LLM capabilities and real-world enterprise demands, highlighting needs for improved multi-turn reasoning, confidentiality adherence, and versatile skill acquisition."

Another paper drop, this time from Salesforce:

"These results underscore a significant gap between current LLM
capabilities and real-world enterprise demands, highlighting needs for improved
multi-turn reasoning, confidentiality adherence, and versatile skill acquisition."
Hou Pong (Ken) Chan (@kenchanhp) 's Twitter Profile Photo

🚀We are thrilled to launch 'Lingshu' – A Generalist Medical Multi-modal Foundation Model! 🩻 🌟 Highlights of Lingshu: ⚕️ Unified knowledge across 12+ imaging modalities (X-Ray, CT, MRI & more!). 🧠 Enhanced reasoning & reduced hallucinations via novel data curation and

🚀We are thrilled to launch 'Lingshu' – A Generalist Medical Multi-modal Foundation Model! 🩻

🌟 Highlights of Lingshu:
⚕️ Unified knowledge across 12+ imaging modalities (X-Ray, CT, MRI &amp; more!).
🧠 Enhanced reasoning &amp; reduced hallucinations via novel data curation and
Pranav Venkit, PhD (@pranavvenkit) 's Twitter Profile Photo

Im really excited to be presenting this work in Greece! 🏛️ As generative text models start reshaping how we search for information, understanding their societal impact is more important than ever. 🔎 If you’ll be at #ACMFAccT2025, let’s grab a coffee and chat! ☕️

Salesforce AI Research (@sfresearch) 's Twitter Profile Photo

1/10🎉New paper on AI Agent and LLM judge safety "Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows" As AI agents become increasingly autonomous, they often rely on feedback from judges (evaluators). These judges evaluate, critique, and

1/10🎉New paper on AI Agent and LLM judge safety "Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows"

As AI agents become increasingly autonomous, they often rely on feedback from judges (evaluators). These judges evaluate, critique, and
Yangyi Chen (on job market) (@yangyichen6666) 's Twitter Profile Photo

🚀 I'm looking for full-time research scientist jobs on foundation models! I study pre-training and post-training of foundation models, and LLM-based coding agents. The figure highlights my research/publications. Please DM me if there is any good fit! Highly appreciated!

🚀 I'm looking for full-time research scientist jobs on foundation models! I study pre-training and post-training of foundation models, and LLM-based coding agents. The figure highlights my research/publications. 

Please DM me if there is any good fit! Highly appreciated!
elvis (@omarsar0) 's Twitter Profile Photo

Andrej Karpathy Great share as usual! Just read this related piece where a study showed issues with LLM-based agents not recognizing sensitive information and not adhering to appropriate data handling protocols: theregister.com/2025/06/16/sal… paper: arxiv.org/abs/2505.18878

Dion Hinchcliffe (@dhinchcliffe) 's Twitter Profile Photo

4/ I’m actually bullish medium term involving AI in customer experience. But IT depts must educate themselves. The details on CRMArenaPro and the gap between LLMs / enterprise CRM needs in a major new paper by Salesforce AI Research’s Kung-Hsiang Steeve Huang + team: arxiv.org/abs/2505.18878