Yoo Yeon Sung (@yooyeonsung1) 's Twitter Profile
Yoo Yeon Sung

@yooyeonsung1

PhD Student @umdclip @iSchoolUMD; Human-AI Alignment, human-interactable NLP, LLM evaluation, benchmark creation, misinformation

ID: 1372646134078971916

linkhttps://yysung.github.io/ calendar_today18-03-2021 20:28:42

56 Tweet

275 Followers

479 Following

lingjiao chen (@chenlingjiao) 's Twitter Profile Photo

Compound AI systems often make multiple LLM calls for complex tasks. But which LLM should be used for each call? We build LLMSELECTOR, a framework that automatically optimizes LLM selection for each call, with substantial perf gains (5-70%) over the best-performing fixed model!

Compound AI systems often make multiple LLM calls for complex tasks. But which LLM should be used for each call? We build LLMSELECTOR, a framework that automatically optimizes LLM selection for each call, with substantial perf gains (5-70%) over the best-performing fixed model!
Zorik Gekhman (@zorikgekhman) 's Twitter Profile Photo

๐Ÿšจ It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this โ€œhidden knowledgeโ€? In our new paper, we clearly define this concept and design controlled experiments to test it. 1/๐Ÿงต

๐Ÿšจ It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this โ€œhidden knowledgeโ€?

In our new paper, we clearly define this concept and design controlled experiments to test it.
1/๐Ÿงต
Daniel Kang (@daniel_d_kang) 's Twitter Profile Photo

Recent research found that malicious AI agents are scanning the internet for vulnerabilities. How dangerous are AI agents at finding & exploiting real-world web vulns? We don't know! Introducing CVE-Bench, the first benchmark evaluating AI agents on real-world web exploits. 1/8

OpenAI (@openai) 's Twitter Profile Photo

Weโ€™re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.

Weโ€™re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework.

Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.
Marcel Bรถhme๐Ÿ‘จโ€๐Ÿ”ฌ (@mboehme_) 's Twitter Profile Photo

Benchmarks are our measures of progress. Or are they? Looking forward to exploring promises & perils of measuring tool capabilities @SBFTworkshop'25! Thanks for the invite! ๐Ÿ‘ฉโ€๐Ÿญ sbft25.github.io (co-located w/ ICSE'25 in Ottawa) ๐Ÿ“… 28.04. 11:00 GMT-4 (Also, live on Twitch)

Benchmarks are our measures of progress. Or are they?

Looking forward to exploring promises & perils of measuring tool capabilities @SBFTworkshop'25! Thanks for the invite!

๐Ÿ‘ฉโ€๐Ÿญ sbft25.github.io (co-located w/ ICSE'25 in Ottawa)
๐Ÿ“… 28.04. 11:00 GMT-4 (Also, live on Twitch)
Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

Turns out, LLMs are pretty bad at balancing debatable queries in summarization โœ๏ธ๐Ÿ“„ Ironically, when I ask Google if LLMs can do this, it says "Yes" without fully covering the other side ๐Ÿคฆ Excited to present my summer work Adobe Research at #NAACL2025 where we fix this! ๐ŸŽ‰

Turns out, LLMs are pretty bad at balancing debatable queries in summarization โœ๏ธ๐Ÿ“„

Ironically, when I ask Google if LLMs can do this, it says "Yes" without fully covering the other side ๐Ÿคฆ

Excited to present my summer work <a href="/AdobeResearch/">Adobe Research</a> at #NAACL2025 where we fix this! ๐ŸŽ‰
Neha Srikanth (@nehasrikanth) 's Twitter Profile Photo

I'll be presenting this work with Rachel Rudinger at #NAACL2025 tomorrow (Wednesday 4/30) during Session C (Oral/Poster 2) at 2pm! ๐Ÿ”ฌ Decomposing hypotheses in traditional NLI and defeasible NLI helps us measure various forms of consistency of LLMs. Come join us!

I'll be presenting this work with <a href="/rachelrudinger/">Rachel Rudinger</a> at #NAACL2025 tomorrow (Wednesday 4/30) during Session C (Oral/Poster 2) at 2pm! ๐Ÿ”ฌ 

Decomposing hypotheses in traditional NLI and defeasible NLI helps us measure various forms of consistency of LLMs. Come join us!
Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

I'll be presenting two papers NAACL HLT 2025! 1. Why LLMs can't write a question with the answer "468" ๐Ÿค”๐Ÿ™… 2. A multi-agent LLM that balances opinions on "is pineapple good on pizza?" ๐ŸŽญ๐Ÿ• Let's also chat about: ๐Ÿ“ Helpfulness ๐Ÿ“ Why MCQA sucks ๐Ÿ“ Generating cute paper titles ๐Ÿ‘€

I'll be presenting two papers <a href="/naaclmeeting/">NAACL HLT 2025</a>!
1. Why LLMs can't write a question with the answer "468" ๐Ÿค”๐Ÿ™…
2. A multi-agent LLM that balances opinions on "is pineapple good on pizza?" ๐ŸŽญ๐Ÿ•

Let's also chat about:
๐Ÿ“ Helpfulness
๐Ÿ“ Why MCQA sucks
๐Ÿ“ Generating cute paper titles ๐Ÿ‘€
Feng Gu@NAACL2025 (@gu_feng73607) 's Twitter Profile Photo

Getting pummeled on boardgame nights by that one trivia nerd who think they are smarter than everyone else? We have an agent that let total noobs dominate the competition. Come check us out on Friday at 11 at Mesilla. #NAACL2025

Getting pummeled on boardgame nights by that one trivia nerd who think they are smarter than everyone else? We have an agent that let total noobs dominate the competition. Come check us out on Friday at 11 at Mesilla. #NAACL2025
Ziang Xiao (@ziangxiao) 's Twitter Profile Photo

How to generate effective benchmarks? In our #ICML2025 paper, Han Jiang leveraged item response theory to dynamically generate test items that can best discriminate model capability with least number of items.

Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

๐ŸŽ‰๐ŸŽ‰ Excited to have two papers accepted to #ACL2025! Our first paper designs a preference training method to boost LLM personalization ๐ŸŽจ While the second outlines our position on why MCQA evals are terrible and how to make them better ๐Ÿ™ Grateful for amazing collaborators!

๐ŸŽ‰๐ŸŽ‰ Excited to have two papers accepted to #ACL2025!

Our first paper designs a preference training method to boost LLM personalization ๐ŸŽจ
While the second outlines our position on why MCQA evals are terrible and how to make them better ๐Ÿ™

Grateful for amazing collaborators!
Yoo Yeon Sung (@yooyeonsung1) 's Twitter Profile Photo

Itโ€™s that time again: QANTA 2025๐Ÿง  This yearโ€™s theme: Cooperation among humans + agents + QA models! We invite you to: ๐Ÿค– Submit models ๐Ÿ“ Write questions to stump them all ๐Ÿงโ€โ™€๏ธ Play as human opponents It's my 3rd year hosting QANTA, always surprises in human-AI eval. Staytunedโณ

Fahim Tajwar (@fahimtajwar10) 's Twitter Profile Photo

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! ๐Ÿงต 1/n

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers?

Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training!

๐Ÿงต 1/n
Michael Kirchhof (@mkirchhof_) 's Twitter Profile Photo

At the end of my PhD, I reflected on uncertainty quantification research, and what might change with chatbots and LLM agents. This was now accepted as position paper @ICML. Some of those future topics are already picking up pace, so have an evening read โ˜• arxiv.org/abs/2505.22655

At the end of my PhD, I reflected on uncertainty quantification research, and what might change with chatbots and LLM agents. This was now accepted as position paper @ICML. Some of those future topics are already picking up pace, so have an evening read โ˜• arxiv.org/abs/2505.22655