Yoo Yeon Sung (@yooyeonsung1) 's Twitter Profile
Yoo Yeon Sung

@yooyeonsung1

PhD Student @umdclip @iSchoolUMD; Human-AI Alignment, human-interactable NLP, LLM evaluation, benchmark creation, misinformation

ID: 1372646134078971916

linkhttps://yysung.github.io/ calendar_today18-03-2021 20:28:42

56 Tweet

275 Followers

479 Following

lingjiao chen (@chenlingjiao) 's Twitter Profile Photo

Compound AI systems often make multiple LLM calls for complex tasks. But which LLM should be used for each call? We build LLMSELECTOR, a framework that automatically optimizes LLM selection for each call, with substantial perf gains (5-70%) over the best-performing fixed model!

Compound AI systems often make multiple LLM calls for complex tasks. But which LLM should be used for each call? We build LLMSELECTOR, a framework that automatically optimizes LLM selection for each call, with substantial perf gains (5-70%) over the best-performing fixed model!
Zorik Gekhman (@zorikgekhman) 's Twitter Profile Photo

🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”? In our new paper, we clearly define this concept and design controlled experiments to test it. 1/🧵

🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”?

In our new paper, we clearly define this concept and design controlled experiments to test it.
1/🧵
Daniel Kang (@daniel_d_kang) 's Twitter Profile Photo

Recent research found that malicious AI agents are scanning the internet for vulnerabilities. How dangerous are AI agents at finding & exploiting real-world web vulns? We don't know! Introducing CVE-Bench, the first benchmark evaluating AI agents on real-world web exploits. 1/8

OpenAI (@openai) 's Twitter Profile Photo

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework.

Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.
Marcel Böhme👨‍🔬 (@mboehme_) 's Twitter Profile Photo

Benchmarks are our measures of progress. Or are they? Looking forward to exploring promises & perils of measuring tool capabilities @SBFTworkshop'25! Thanks for the invite! 👩‍🏭 sbft25.github.io (co-located w/ ICSE'25 in Ottawa) 📅 28.04. 11:00 GMT-4 (Also, live on Twitch)

Benchmarks are our measures of progress. Or are they?

Looking forward to exploring promises & perils of measuring tool capabilities @SBFTworkshop'25! Thanks for the invite!

👩‍🏭 sbft25.github.io (co-located w/ ICSE'25 in Ottawa)
📅 28.04. 11:00 GMT-4 (Also, live on Twitch)
Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

Turns out, LLMs are pretty bad at balancing debatable queries in summarization ✍️📄 Ironically, when I ask Google if LLMs can do this, it says "Yes" without fully covering the other side 🤦 Excited to present my summer work Adobe Research at #NAACL2025 where we fix this! 🎉

Turns out, LLMs are pretty bad at balancing debatable queries in summarization ✍️📄

Ironically, when I ask Google if LLMs can do this, it says "Yes" without fully covering the other side 🤦

Excited to present my summer work <a href="/AdobeResearch/">Adobe Research</a> at #NAACL2025 where we fix this! 🎉
Neha Srikanth (@nehasrikanth) 's Twitter Profile Photo

I'll be presenting this work with Rachel Rudinger at #NAACL2025 tomorrow (Wednesday 4/30) during Session C (Oral/Poster 2) at 2pm! 🔬 Decomposing hypotheses in traditional NLI and defeasible NLI helps us measure various forms of consistency of LLMs. Come join us!

I'll be presenting this work with <a href="/rachelrudinger/">Rachel Rudinger</a> at #NAACL2025 tomorrow (Wednesday 4/30) during Session C (Oral/Poster 2) at 2pm! 🔬 

Decomposing hypotheses in traditional NLI and defeasible NLI helps us measure various forms of consistency of LLMs. Come join us!
Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

I'll be presenting two papers NAACL HLT 2025! 1. Why LLMs can't write a question with the answer "468" 🤔🙅 2. A multi-agent LLM that balances opinions on "is pineapple good on pizza?" 🎭🍕 Let's also chat about: 📍 Helpfulness 📍 Why MCQA sucks 📍 Generating cute paper titles 👀

I'll be presenting two papers <a href="/naaclmeeting/">NAACL HLT 2025</a>!
1. Why LLMs can't write a question with the answer "468" 🤔🙅
2. A multi-agent LLM that balances opinions on "is pineapple good on pizza?" 🎭🍕

Let's also chat about:
📍 Helpfulness
📍 Why MCQA sucks
📍 Generating cute paper titles 👀
Feng Gu@NAACL2025 (@gu_feng73607) 's Twitter Profile Photo

Getting pummeled on boardgame nights by that one trivia nerd who think they are smarter than everyone else? We have an agent that let total noobs dominate the competition. Come check us out on Friday at 11 at Mesilla. #NAACL2025

Getting pummeled on boardgame nights by that one trivia nerd who think they are smarter than everyone else? We have an agent that let total noobs dominate the competition. Come check us out on Friday at 11 at Mesilla. #NAACL2025
Ziang Xiao (@ziangxiao) 's Twitter Profile Photo

How to generate effective benchmarks? In our #ICML2025 paper, Han Jiang leveraged item response theory to dynamically generate test items that can best discriminate model capability with least number of items.

Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

🎉🎉 Excited to have two papers accepted to #ACL2025! Our first paper designs a preference training method to boost LLM personalization 🎨 While the second outlines our position on why MCQA evals are terrible and how to make them better 🙏 Grateful for amazing collaborators!

🎉🎉 Excited to have two papers accepted to #ACL2025!

Our first paper designs a preference training method to boost LLM personalization 🎨
While the second outlines our position on why MCQA evals are terrible and how to make them better 🙏

Grateful for amazing collaborators!
Yoo Yeon Sung (@yooyeonsung1) 's Twitter Profile Photo

It’s that time again: QANTA 2025🧠 This year’s theme: Cooperation among humans + agents + QA models! We invite you to: 🤖 Submit models 📝 Write questions to stump them all 🧍‍♀️ Play as human opponents It's my 3rd year hosting QANTA, always surprises in human-AI eval. Staytuned⏳

Fahim Tajwar (@fahimtajwar10) 's Twitter Profile Photo

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers?

Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training!

🧵 1/n
Michael Kirchhof (@mkirchhof_) 's Twitter Profile Photo

At the end of my PhD, I reflected on uncertainty quantification research, and what might change with chatbots and LLM agents. This was now accepted as position paper @ICML. Some of those future topics are already picking up pace, so have an evening read ☕ arxiv.org/abs/2505.22655

At the end of my PhD, I reflected on uncertainty quantification research, and what might change with chatbots and LLM agents. This was now accepted as position paper @ICML. Some of those future topics are already picking up pace, so have an evening read ☕ arxiv.org/abs/2505.22655