Mohit Iyyer (@mohitiyyer) Twitter Tweets • TwiCopy

Dayeon (Zoey) Ki

7 months ago

🚨New Paper🚨 1/ We often assume that well-written text is easier to translate ✏️ But can #LLMs automatically rewrite inputs to improve machine translation? 🌎 Here's what we found 🧵

thumb_up_off_alt51

chat_bubble_outline2

repeat19

shareShare

Andrew Drozdov

@mrdrozdov

7 months ago

🚨New RAG Dataset Release🚨 Lead by Nandan Thakur: we’ve curated real long and complex questions, each requiring multiple retrieved documents covering a diverse set of concepts (i.e. nuggets).

thumb_up_off_alt59

chat_bubble_outline1

repeat11

shareShare

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

thumb_up_off_alt241

chat_bubble_outline3

repeat46

shareShare

Tuhin Chakrabarty

@tuhinchakr

7 months ago

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

thumb_up_off_alt202

chat_bubble_outline10

repeat31

shareShare

Manya Wadhwa

@manyawadhwa1

7 months ago

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

thumb_up_off_alt116

chat_bubble_outline4

repeat33

shareShare

Prithviraj (Raj) Ammanabrolu

@rajammanabrolu

7 months ago

Introducing TALES - Text Adventure Learning Environment Suite A benchmark of a few hundred text envs: science experiments and embodied cooking to solving murder mysteries. We test over 30 of the best LLM agents and pinpoint failure modes +how to improve 👨‍💻pip install tale-suite

thumb_up_off_alt65

chat_bubble_outline2

repeat19

shareShare

Piotr Nawrot

@p_nawrot

7 months ago

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

thumb_up_off_alt596

chat_bubble_outline5

repeat102

shareShare

Yoo Yeon Sung

@yooyeonsung1

7 months ago

🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025 NAACL HLT 2025!! If you want to learn how to make your benchmark *actually* adversarial, come find me: 📍Poster Session 5 - HC: Human-centered NLP 📅May 1 @ 2PM Hiring for human-focused AI dev/LLM eval? Let’s talk! 💼

🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025 <a href="/naaclmeeting/">NAACL HLT 2025</a>!!

If you want to learn how to make your benchmark *actually* adversarial, come find me:
📍Poster Session 5 - HC: Human-centered NLP
📅May 1 @ 2PM

Hiring for human-focused AI dev/LLM eval? Let’s talk! 💼

thumb_up_off_alt65

chat_bubble_outline2

repeat17

shareShare

Aran Komatsuzaki

@arankomatsuzaki

7 months ago

The Leaderboard Illusion - Identifies systematic issues that have resulted in a distorted playing field of Chatbot Arena - Identifies 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release

thumb_up_off_alt502

chat_bubble_outline16

repeat79

shareShare

Stanford AI Lab

@stanfordailab

7 months ago

How do LLMs memorize long sequences of text verbatim? Check out our latest blog post, which shows that verbatim memorization is intertwined with the LM’s general capabilities. ai.stanford.edu/blog/verbatim-…

thumb_up_off_alt71

chat_bubble_outline2

repeat33

shareShare

Andrew Piper

@_akpiper

7 months ago

New work coming out in the Workshop for Narrative Understanding proceedings! 🎆 Check it out!

thumb_up_off_alt17

chat_bubble_outline0

repeat6

shareShare

Katherine Thai

@kthai1618

7 months ago

After a great team lunch (ft. mango sticky rice) and successful brainstorming sesh, I’m so excited to start working with Pangram Labs for the summer as a research scientist intern! Hard not to be excited when you hear the team talk about what they’re working on—stay tuned :)

thumb_up_off_alt21

chat_bubble_outline1

repeat4

shareShare

Peter West

@peterwesttm

6 months ago

I’ve been fascinated lately by the question: what kinds of capabilities might base LLMs lose when they are aligned? i.e. where can alignment make models WORSE? I’ve been looking into this with Christopher Potts and here's one piece of the answer: randomness and creativity

thumb_up_off_alt348

chat_bubble_outline11

repeat59

shareShare

Ofir Press

@ofirpress

6 months ago

We just pushed out a new version of the SWE-bench library that allows you to easily evaluate on a all new set of 300 tasks in 9 languages.

thumb_up_off_alt45

chat_bubble_outline3

repeat3

shareShare

Ethan Mollick

@emollick

6 months ago

One of the great ironies of AI writing is that the only people who can detect it with accuracy are people who use AI for writing a lot (at least if you take a majority vote among five such people). Non-users are no better than chance, and AI detectors are also less accurate.

thumb_up_off_alt869

chat_bubble_outline37

repeat107

shareShare

Daniel Khashabi 🕊️

@danielkhashabi

6 months ago

**Certified Mitigation of Worst-Case LLM Copyright Infringement** TL;DR: We propose BloomScrub a framework to certifiably remove long verbatim quotes to reduce the risk of copyright violations. Challenge: Most existing copyright mitigation techniques for LLMs address

thumb_up_off_alt34

chat_bubble_outline6

repeat10

shareShare

Wenting Zhao

@wzhao_nlp

6 months ago

Some personal news: I'll join UMass Amherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

thumb_up_off_alt833

chat_bubble_outline95

repeat31

shareShare

Mohit Iyyer

@mohitiyyer

6 months ago

GRPO + BLEU is a surprisingly good combination for improving instruction following in LLMs, yielding results on par with those from strong reward models in our experiments! Check out our paper for more 👇

thumb_up_off_alt62

chat_bubble_outline0

repeat5

shareShare

Maharshi Gor

@maharshigor

6 months ago

Can you spot when AI bluffs?🤖 Can you outguess AI—or work with one to dominate trivia?🏁 🏆 We are hosting the first Human–AI coop trivia (Quizzing) competition. 🎲Play, 🛠️build, or ✍🏼write questions... ..and win prizes 🎁. 🥳 It’s fun, free, and happening this June 🧠🤖👇

thumb_up_off_alt19

chat_bubble_outline1

repeat13

shareShare

jack morris

@jxmnop

6 months ago

excited to finally share on arxiv what we've known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data feels like magic, but it's real:🧵

thumb_up_off_alt6,6K

chat_bubble_outline124

repeat618

shareShare

Mohit Iyyer

Dayeon (Zoey) Ki

Andrew Drozdov

Kabir

Tuhin Chakrabarty

Manya Wadhwa

Prithviraj (Raj) Ammanabrolu

Piotr Nawrot

Yoo Yeon Sung

Aran Komatsuzaki

Stanford AI Lab

Andrew Piper

Katherine Thai

Peter West

Ofir Press

Ethan Mollick

Daniel Khashabi 🕊️

Wenting Zhao

Mohit Iyyer

Maharshi Gor

jack morris