Yoo Yeon Sung (@yooyeonsung1) Twitter Tweets • TwiCopy

Yoo Yeon Sung

@yooyeonsung1

+ Follow

PhD Student @umdclip @iSchoolUMD; Human-AI Alignment, human-interactable NLP, LLM evaluation, benchmark creation, misinformation

ID: 1372646134078971916

linkhttps://yysung.github.io/ calendar_today18-03-2021 20:28:42

56 Tweet

275 Followers

479 Following

lingjiao chen

@chenlingjiao

8 months ago

Compound AI systems often make multiple LLM calls for complex tasks. But which LLM should be used for each call? We build LLMSELECTOR, a framework that automatically optimizes LLM selection for each call, with substantial perf gains (5-70%) over the best-performing fixed model!

thumb_up_off_alt52

chat_bubble_outline2

repeat11

shareShare

Zorik Gekhman

@zorikgekhman

7 months ago

🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”? In our new paper, we clearly define this concept and design controlled experiments to test it. 1/🧵

thumb_up_off_alt221

chat_bubble_outline4

repeat59

shareShare

Daniel Kang

@daniel_d_kang

7 months ago

Recent research found that malicious AI agents are scanning the internet for vulnerabilities. How dangerous are AI agents at finding & exploiting real-world web vulns? We don't know! Introducing CVE-Bench, the first benchmark evaluating AI agents on real-world web exploits. 1/8

thumb_up_off_alt160

chat_bubble_outline5

repeat44

shareShare

OpenAI

@openai

7 months ago

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.

thumb_up_off_alt7,7K

chat_bubble_outline221

repeat1,1K

shareShare

Marcel Böhme👨‍🔬

@mboehme_

7 months ago

Benchmarks are our measures of progress. Or are they? Looking forward to exploring promises & perils of measuring tool capabilities @SBFTworkshop'25! Thanks for the invite! 👩‍🏭 sbft25.github.io (co-located w/ ICSE'25 in Ottawa) 📅 28.04. 11:00 GMT-4 (Also, live on Twitch)

thumb_up_off_alt16

chat_bubble_outline1

repeat6

shareShare

Nishant Balepur

@nishantbalepur

6 months ago

Turns out, LLMs are pretty bad at balancing debatable queries in summarization ✍️📄 Ironically, when I ask Google if LLMs can do this, it says "Yes" without fully covering the other side 🤦 Excited to present my summer work Adobe Research at #NAACL2025 where we fix this! 🎉

thumb_up_off_alt48

chat_bubble_outline1

repeat11

shareShare

Neha Srikanth

@nehasrikanth

6 months ago

I'll be presenting this work with Rachel Rudinger at #NAACL2025 tomorrow (Wednesday 4/30) during Session C (Oral/Poster 2) at 2pm! 🔬 Decomposing hypotheses in traditional NLI and defeasible NLI helps us measure various forms of consistency of LLMs. Come join us!

I'll be presenting this work with <a href="/rachelrudinger/">Rachel Rudinger</a> at #NAACL2025 tomorrow (Wednesday 4/30) during Session C (Oral/Poster 2) at 2pm! 🔬

Decomposing hypotheses in traditional NLI and defeasible NLI helps us measure various forms of consistency of LLMs. Come join us!

thumb_up_off_alt51

chat_bubble_outline5

repeat8

shareShare

Nishant Balepur

@nishantbalepur

6 months ago

I'll be presenting two papers NAACL HLT 2025! 1. Why LLMs can't write a question with the answer "468" 🤔🙅 2. A multi-agent LLM that balances opinions on "is pineapple good on pizza?" 🎭🍕 Let's also chat about: 📍 Helpfulness 📍 Why MCQA sucks 📍 Generating cute paper titles 👀

I'll be presenting two papers <a href="/naaclmeeting/">NAACL HLT 2025</a>!
1. Why LLMs can't write a question with the answer "468" 🤔🙅
2. A multi-agent LLM that balances opinions on "is pineapple good on pizza?" 🎭🍕

Let's also chat about:
📍 Helpfulness
📍 Why MCQA sucks
📍 Generating cute paper titles 👀

thumb_up_off_alt41

chat_bubble_outline1

repeat15

shareShare

Feng Gu@NAACL2025

@gu_feng73607

6 months ago

Getting pummeled on boardgame nights by that one trivia nerd who think they are smarter than everyone else? We have an agent that let total noobs dominate the competition. Come check us out on Friday at 11 at Mesilla. #NAACL2025

thumb_up_off_alt10

chat_bubble_outline1

repeat4

shareShare

Ziang Xiao

@ziangxiao

6 months ago

How to generate effective benchmarks? In our #ICML2025 paper, Han Jiang leveraged item response theory to dynamically generate test items that can best discriminate model capability with least number of items.

thumb_up_off_alt20

chat_bubble_outline0

repeat5

shareShare

Nishant Balepur

@nishantbalepur

6 months ago

🎉🎉 Excited to have two papers accepted to #ACL2025! Our first paper designs a preference training method to boost LLM personalization 🎨 While the second outlines our position on why MCQA evals are terrible and how to make them better 🙏 Grateful for amazing collaborators!

thumb_up_off_alt94

chat_bubble_outline7

repeat14

shareShare

Yoo Yeon Sung

@yooyeonsung1

6 months ago

It’s that time again: QANTA 2025🧠 This year’s theme: Cooperation among humans + agents + QA models! We invite you to: 🤖 Submit models 📝 Write questions to stump them all 🧍‍♀️ Play as human opponents It's my 3rd year hosting QANTA, always surprises in human-AI eval. Staytuned⏳

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

Fahim Tajwar

@fahimtajwar10

5 months ago

RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n

thumb_up_off_alt819

chat_bubble_outline20

repeat136

shareShare

Michael Kirchhof

@mkirchhof_

5 months ago

At the end of my PhD, I reflected on uncertainty quantification research, and what might change with chatbots and LLM agents. This was now accepted as position paper @ICML. Some of those future topics are already picking up pace, so have an evening read ☕ arxiv.org/abs/2505.22655

thumb_up_off_alt137

chat_bubble_outline4

repeat15

shareShare