Lucy Li (@lucy3_li) Twitter Tweets • TwiCopy

Nishant Balepur

9 months ago

🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵

thumb_up_off_alt178

chat_bubble_outline6

repeat32

shareShare

Serina Chang

@serinachang5

6 months ago

Excited to have two papers accepted to ACL 2025 main! 🎉 1. ChatBench with jake hofman Ashton Anderson - we conduct a large-scale user study converting static benchmark questions into human-AI conversations, showing how benchmarks fail to predict human-AI outcomes.

Excited to have two papers accepted to ACL 2025 main! 🎉

1. ChatBench with <a href="/jakehofman/">jake hofman</a> <a href="/ashton1anderson/">Ashton Anderson</a> - we conduct a large-scale user study converting static benchmark questions into human-AI conversations, showing how benchmarks fail to predict human-AI outcomes.

thumb_up_off_alt91

chat_bubble_outline2

repeat10

shareShare

Hanna Wallach (@hannawallach.bsky.social)

@hannawallach

6 months ago

Exciting news: the Fairness, Accountability, Transparency and Ethics (FATE) group at Microsoft Research NYC is hiring a predoctoral fellow!!! 🎉 microsoft.com/en-us/research…

thumb_up_off_alt67

chat_bubble_outline3

repeat26

shareShare

Andrew Piper

@_akpiper

6 months ago

Do you love children's books! Well then come over to our new Citizen Science project: Picturing Children's Stories. Help us annotate tens of thousands of book illustrations to understand the history of childhood and visual storytelling.

thumb_up_off_alt5

chat_bubble_outline1

repeat4

shareShare

Yapei Chang

@yapeichang

6 months ago

🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:

thumb_up_off_alt191

chat_bubble_outline6

repeat41

shareShare

Myra Cheng

@chengmyra1

6 months ago

Dear ChatGPT, Am I the Asshole? While Reddit users might say yes, your favorite LLM probably won’t. We present Social Sycophancy: a new way to understand and measure sycophancy as how LLMs overly preserve users' self-image.

thumb_up_off_alt303

chat_bubble_outline12

repeat33

shareShare

Lisa Wehden

@lisawehden

6 months ago

We're offering free consultations for all Harvard University University researchers to discuss visa options. Please reach out Plymouth or email me directly [email protected]

We're offering free consultations for all <a href="/Harvard/">Harvard University</a> University researchers to discuss visa options.

Please reach out <a href="/plymouthstreet/">Plymouth</a> or email me directly lisa@plymouthstreet.com

thumb_up_off_alt489

chat_bubble_outline16

repeat88

shareShare

Michael Black

@michael_j_black

6 months ago

If you're an international PhD student at Harvard studying computer vision and your visa is cancelled, reach out to me or others in Europe. Don't despair. I'm sure we can find you a great place to carry on your research.

thumb_up_off_alt1,1K

chat_bubble_outline25

repeat94

shareShare

Kiran Garimella

@gvrkiran

6 months ago

this paper quantiatively shows what someone told me: "everyone loves interdisciplinary but no one will give u a job if you are interdisciplinary" Hiring at top universities rewards disciplinary loyalty over interdisciplinary breadth. Things are changing. arxiv.org/abs/2503.21912

thumb_up_off_alt639

chat_bubble_outline9

repeat126

shareShare

Divya Siddarth

@divyasiddarth

6 months ago

As we do societal evals at CIP —public health, AI relationships, democracy, etc. across regional languages we've spent a lot of time dealing with how brittle LLM judge pipelines are. Stoked to share an open-source test suite (blog + code) we’ve built to stress-test ours before

thumb_up_off_alt85

chat_bubble_outline3

repeat16

shareShare

Kayo Yin

@kayo_yin

6 months ago

Happy to announce the first workshop on Pragmatic Reasoning in Language Models — PragLM @ COLM 2025! 🧠🎉 How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach? 🌐 sites.google.com/berkeley.edu/p… 📅 Submit by June 23rd

thumb_up_off_alt78

chat_bubble_outline4

repeat17

shareShare

Lucy Li

@lucy3_li

6 months ago

"Tell, Don't Show" was accepted to #ACL2025 Findings! Our simple approach for literary topic modeling combines the new (language models) with the old (classic LDA) to yield better topics. A possible addition to your CSS/DH research 🛠️ box ✨📚 arxiv.org/abs/2505.23166

thumb_up_off_alt126

chat_bubble_outline4

repeat19

shareShare

Nhat Minh Le

@minhnhle

6 months ago

LLMs memorize novels 📚 in English. But what about existing translations? Or translations into new languages? Our 🦉OWL dataset (31K/10 languages) shows GPT4o recognizes books: 92% English 83% official translations 69% unseen translations 75% as audio (EN)

thumb_up_off_alt20

chat_bubble_outline1

repeat10

shareShare

zhou Yu

@zhou_yu_ai

6 months ago

I wrote this blog post to share practical tips on how academics can collaborate with industry to explore alternative funding sources. Amid all the government cuts, I hope this can help other faculties. Feel free to reach out to me if you need more help. The universe conspires

thumb_up_off_alt199

chat_bubble_outline1

repeat18

shareShare

Diyi Yang

@diyi_yang

6 months ago

🤝 Humans + AI = Better together? Our #ACL2025 tutorial offers an interdisciplinary overview of human-AI collaboration to explore its goals, evaluation, and societal impacts 🤖

thumb_up_off_alt115

chat_bubble_outline6

repeat14

shareShare

Angelina Wang @angelinawang.bsky.social

@ang3linawang

6 months ago

Have you ever felt that AI fairness was too strict, enforcing fairness when it didn’t seem necessary? How about too narrow, missing a wide range of important harms? We argue that the way to address both of these critiques is to discriminate more 🧵

thumb_up_off_alt35

chat_bubble_outline2

repeat11

shareShare

ashray gupta

@ashraygup

6 months ago

yes, this is why I asked you to do it

thumb_up_off_alt650

chat_bubble_outline2

repeat24

shareShare

Shaily

@shaily99

5 months ago

🖋️ Curious how writing differs across (research) cultures? 🚩 Tired of “cultural” evals that don't consult people? We engaged with researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗ 📜 arxiv.org/abs/2506.00784 1/11

thumb_up_off_alt81

chat_bubble_outline2

repeat16

shareShare

Morgan Klaus Scheuerman, PhD (he/him)

@morganklauss

5 months ago

How can ethical principles translate to the massive data used to train foundation models, like generative AI? Our #CSCW2025 workshop aims to explore how best to define the future of ethical responsibility in largescale datasets for FM training. Apply here: tinyurl.com/CSCW-data

thumb_up_off_alt13

chat_bubble_outline0

repeat3

shareShare

Percy Liang

@percyliang

5 months ago

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team Tatsunori Hashimoto Marcel Rød Neil Band Rohith Kuditipudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

thumb_up_off_alt3,3K

chat_bubble_outline31

repeat323

shareShare