Kaiser Sun (@kaiserwholearns) Twitter Tweets • TwiCopy

Kaiser Sun

@kaiserwholearns

+ Follow

Ph.D. student at @jhuclsp, human LM that hallucinates. Formerly @MetaAI, @uwnlp, and @AWS they/them🏳️‍🌈

ID: 1389284445908213760

linkhttps://kaiserwholearns.github.io/ calendar_today03-05-2021 18:23:28

302 Tweet

977 Takipçi

464 Takip Edilen

Stella Li

@stellalisy

6 months ago

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

thumb_up_off_alt1,1K

chat_bubble_outline69

repeat322

shareShare

Kristina Gligorić

@krisgligoric

6 months ago

I'm excited to announce that I’ll be joining the Computer Science department at Johns Hopkins University as an Assistant Professor this Fall! I’ll be working on large language models, computational social science, and AI & society—and will be recruiting PhD students. Apply to work with me!

I'm excited to announce that I’ll be joining the Computer Science department at <a href="/JohnsHopkins/">Johns Hopkins University</a> as an Assistant Professor this Fall! I’ll be working on large language models, computational social science, and AI & society—and will be recruiting PhD students. Apply to work with me!

thumb_up_off_alt3,3K

chat_bubble_outline120

repeat179

shareShare

Fangcong Yin

@fangcong_y10593

6 months ago

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

thumb_up_off_alt87

chat_bubble_outline5

repeat31

shareShare

Tiago Pimentel

@tpimentelms

6 months ago

A string may get 17 times less probability if tokenised as two symbols (e.g., ⟨he, llo⟩) than as one (e.g., ⟨hello⟩)—by an LM trained from scratch in each situation! Our #acl2025nlp paper proposes an observational method to estimate this causal effect! Longer thread soon!

thumb_up_off_alt80

chat_bubble_outline2

repeat16

shareShare

Alex Gill

@alex_gill_nlp

6 months ago

𝐖𝐡𝐚𝐭 𝐇𝐚𝐬 𝐁𝐞𝐞𝐧 𝐋𝐨𝐬𝐭 𝐖𝐢𝐭𝐡 𝐒𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧? I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of Abhilasha Ravichander and Ana Marasović (Full link below 👇)

thumb_up_off_alt45

chat_bubble_outline1

repeat11

shareShare

Niyati Bafna

@bafnaniyati

6 months ago

We know speech LID systems flunk on accented speech. But why? And what to do about it?🤔Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that a model relies on, and proposes a fix.

thumb_up_off_alt13

chat_bubble_outline1

repeat7

shareShare

Mark Dredze

@mdredze

5 months ago

Our new paper explores knowledge conflict in LLMs. It also issues a word of warning to those using LLMs as a Judge: the model can't help but inject its own knowledge into its decisions.

thumb_up_off_alt43

chat_bubble_outline0

repeat9

shareShare

Chenghao Yang

@chrome1996

5 months ago

Have you noticed… 🔍 Aligned LLM generations feel less diverse? 🎯 Base models are decoding-sensitive? 🤔 Generations get more predictable as they progress? 🌲 Tree search fails mid-generation (esp. for reasoning)? We trace these mysteries to LLM probability concentration, and

thumb_up_off_alt88

chat_bubble_outline1

repeat25

shareShare

Nouha Dziri

@nouhadziri

5 months ago

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

thumb_up_off_alt714

chat_bubble_outline22

repeat157

shareShare

CLS

@chengleisi

5 months ago

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

thumb_up_off_alt553

chat_bubble_outline10

repeat162

shareShare

Niyati Bafna

@bafnaniyati

5 months ago

📢When LLMs solve tasks with a mid-to-low resource input/target language, their output quality is poor. We know that. But can we pin down what breaks inside the LLM? We introduce the 💥translation barrier hypothesis💥 for failed multilingual generation. arxiv.org/abs/2506.22724

thumb_up_off_alt33

chat_bubble_outline1

repeat9

shareShare