Kaiser Sun (@kaiserwholearns) 's Twitter Profile
Kaiser Sun

@kaiserwholearns

Ph.D. student at @jhuclsp, human LM that hallucinates. Formerly @MetaAI, @uwnlp, and @AWS they/them🏳️‍🌈

ID: 1389284445908213760

linkhttps://kaiserwholearns.github.io/ calendar_today03-05-2021 18:23:28

302 Tweet

977 Takipçi

464 Takip Edilen

Stella Li (@stellalisy) 's Twitter Profile Photo

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

🤯 We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even work⁉️ Here's why: 🧵
Blogpost: tinyurl.com/spurious-rewar…
Kristina Gligorić (@krisgligoric) 's Twitter Profile Photo

I'm excited to announce that I’ll be joining the Computer Science department at Johns Hopkins University as an Assistant Professor this Fall! I’ll be working on large language models, computational social science, and AI & society—and will be recruiting PhD students. Apply to work with me!

I'm excited to announce that I’ll be joining the Computer Science department at <a href="/JohnsHopkins/">Johns Hopkins University</a> as an Assistant Professor this Fall! I’ll be working on large language models, computational social science, and AI &amp; society—and will be recruiting PhD students. Apply to work with me!
Fangcong Yin (@fangcong_y10593) 's Twitter Profile Photo

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

Solving complex problems with CoT requires combining different skills.

We can do this by:
🧩Modify the CoT data format to be “composable” with other skills
🔥Train models on each skill
📌Combine those models

Lead to better 0-shot reasoning on tasks involving skill composition!
Tiago Pimentel (@tpimentelms) 's Twitter Profile Photo

A string may get 17 times less probability if tokenised as two symbols (e.g., ⟨he, llo⟩) than as one (e.g., ⟨hello⟩)—by an LM trained from scratch in each situation! Our #acl2025nlp paper proposes an observational method to estimate this causal effect! Longer thread soon!

A string may get 17 times less probability if tokenised as two symbols (e.g., ⟨he, llo⟩) than as one (e.g., ⟨hello⟩)—by an LM trained from scratch in each situation! Our #acl2025nlp paper proposes an observational method to estimate this causal effect! Longer thread soon!
Niyati Bafna (@bafnaniyati) 's Twitter Profile Photo

We know speech LID systems flunk on accented speech. But why? And what to do about it?🤔Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that a model relies on, and proposes a fix.

We know speech LID systems flunk on accented speech. But why? And what to do about it?🤔Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that a model relies on, and proposes a fix.
Mark Dredze (@mdredze) 's Twitter Profile Photo

Our new paper explores knowledge conflict in LLMs. It also issues a word of warning to those using LLMs as a Judge: the model can't help but inject its own knowledge into its decisions.

Chenghao Yang (@chrome1996) 's Twitter Profile Photo

Have you noticed… 🔍 Aligned LLM generations feel less diverse? 🎯 Base models are decoding-sensitive? 🤔 Generations get more predictable as they progress? 🌲 Tree search fails mid-generation (esp. for reasoning)? We trace these mysteries to LLM probability concentration, and

Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? 

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬

 We built a benchmark to find out → OMEGA Ω 📐

💥 We found
CLS (@chengleisi) 's Twitter Profile Photo

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

Are AI scientists already better than human researchers?

We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts.

Main finding: LLM ideas result in worse projects than human ideas.
Niyati Bafna (@bafnaniyati) 's Twitter Profile Photo

📢When LLMs solve tasks with a mid-to-low resource input/target language, their output quality is poor. We know that. But can we pin down what breaks inside the LLM? We introduce the 💥translation barrier hypothesis💥 for failed multilingual generation. arxiv.org/abs/2506.22724

📢When LLMs solve tasks with a mid-to-low resource input/target language, their output quality is poor. We know that. But can we pin down what breaks inside the LLM? We introduce the 💥translation barrier hypothesis💥 for failed multilingual generation. arxiv.org/abs/2506.22724