Mohit Iyyer (@mohitiyyer) 's Twitter Profile
Mohit Iyyer

@mohitiyyer

assoc. prof at @umdcs, i work on nlp/llms

ID: 865254426101067778

linkhttp://cs.umd.edu/~miyyer calendar_today18-05-2017 17:15:05

965 Tweet

6,6K Followers

1,1K Following

Dayeon (Zoey) Ki (@zoeykii) 's Twitter Profile Photo

🚨New Paper🚨 1/ We often assume that well-written text is easier to translate ✏️ But can #LLMs automatically rewrite inputs to improve machine translation? 🌎 Here's what we found 🧵

🚨New Paper🚨

1/ We often assume that well-written text is easier to translate ✏️

But can #LLMs automatically rewrite inputs to improve machine translation? 🌎

Here's what we found 🧵
Andrew Drozdov (@mrdrozdov) 's Twitter Profile Photo

🚨New RAG Dataset Release🚨 Lead by Nandan Thakur: we’ve curated real long and complex questions, each requiring multiple retrieved documents covering a diverse set of concepts (i.e. nuggets).

Kabir (@kabirahuja004) 's Twitter Profile Photo

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

📢 New Paper!

Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎

W/ <a href="/melaniesclar/">Melanie Sclar</a>, and <a href="/tsvetshop/">tsvetshop</a>

1/n
Tuhin Chakrabarty (@tuhinchakr) 's Twitter Profile Photo

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
Manya Wadhwa (@manyawadhwa1) 's Twitter Profile Photo

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

Prithviraj (Raj) Ammanabrolu (@rajammanabrolu) 's Twitter Profile Photo

Introducing TALES - Text Adventure Learning Environment Suite A benchmark of a few hundred text envs: science experiments and embodied cooking to solving murder mysteries. We test over 30 of the best LLM agents and pinpoint failure modes +how to improve 👨‍💻pip install tale-suite

Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs.

We performed the most comprehensive study on training-free sparse attention to date.

Here is what we found:
Yoo Yeon Sung (@yooyeonsung1) 's Twitter Profile Photo

🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025 NAACL HLT 2025!! If you want to learn how to make your benchmark *actually* adversarial, come find me: 📍Poster Session 5 - HC: Human-centered NLP 📅May 1 @ 2PM Hiring for human-focused AI dev/LLM eval? Let’s talk! 💼

🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025 <a href="/naaclmeeting/">NAACL HLT 2025</a>!!

If you want to learn how to make your benchmark *actually* adversarial, come find me:
📍Poster Session 5 - HC: Human-centered NLP
📅May 1 @ 2PM

Hiring for human-focused AI dev/LLM eval? Let’s talk! 💼
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

The Leaderboard Illusion - Identifies systematic issues that have resulted in a distorted playing field of Chatbot Arena - Identifies 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release

The Leaderboard Illusion

- Identifies systematic issues that have resulted in a distorted playing field of Chatbot Arena

- Identifies 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release
Stanford AI Lab (@stanfordailab) 's Twitter Profile Photo

How do LLMs memorize long sequences of text verbatim? Check out our latest blog post, which shows that verbatim memorization is intertwined with the LM’s general capabilities. ai.stanford.edu/blog/verbatim-…

Katherine Thai (@kthai1618) 's Twitter Profile Photo

After a great team lunch (ft. mango sticky rice) and successful brainstorming sesh, I’m so excited to start working with Pangram Labs for the summer as a research scientist intern! Hard not to be excited when you hear the team talk about what they’re working on—stay tuned :)

Peter West (@peterwesttm) 's Twitter Profile Photo

I’ve been fascinated lately by the question: what kinds of capabilities might base LLMs lose when they are aligned? i.e. where can alignment make models WORSE? I’ve been looking into this with Christopher Potts and here's one piece of the answer: randomness and creativity

I’ve been fascinated lately by the question: what kinds of capabilities might base LLMs lose when they are aligned? i.e. where can alignment make models WORSE? I’ve been looking into this with <a href="/ChrisGPotts/">Christopher Potts</a> and here's one piece of the answer: randomness and creativity
Ofir Press (@ofirpress) 's Twitter Profile Photo

We just pushed out a new version of the SWE-bench library that allows you to easily evaluate on a all new set of 300 tasks in 9 languages.

Ethan Mollick (@emollick) 's Twitter Profile Photo

One of the great ironies of AI writing is that the only people who can detect it with accuracy are people who use AI for writing a lot (at least if you take a majority vote among five such people). Non-users are no better than chance, and AI detectors are also less accurate.

One of the great ironies of AI writing is that the only people who can detect it with accuracy are people who use AI for writing a lot (at least if you take a majority vote among five such people).

Non-users are no better than chance, and AI detectors are also less accurate.
Daniel Khashabi 🕊️ (@danielkhashabi) 's Twitter Profile Photo

**Certified Mitigation of Worst-Case LLM Copyright Infringement** TL;DR: We propose BloomScrub a framework to certifiably remove long verbatim quotes to reduce the risk of copyright violations. Challenge: Most existing copyright mitigation techniques for LLMs address

**Certified Mitigation of Worst-Case LLM Copyright Infringement**

TL;DR: We propose BloomScrub a framework to certifiably remove long verbatim quotes to reduce the risk of copyright violations.

Challenge: Most existing copyright mitigation techniques for LLMs address
Wenting Zhao (@wzhao_nlp) 's Twitter Profile Photo

Some personal news: I'll join UMass Amherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

Mohit Iyyer (@mohitiyyer) 's Twitter Profile Photo

GRPO + BLEU is a surprisingly good combination for improving instruction following in LLMs, yielding results on par with those from strong reward models in our experiments! Check out our paper for more 👇

Maharshi Gor (@maharshigor) 's Twitter Profile Photo

Can you spot when AI bluffs?🤖 Can you outguess AI—or work with one to dominate trivia?🏁 🏆 We are hosting the first Human–AI coop trivia (Quizzing) competition. 🎲Play, 🛠️build, or ✍🏼write questions... ..and win prizes 🎁. 🥳 It’s fun, free, and happening this June 🧠🤖👇

Can you spot when AI bluffs?🤖 Can you outguess AI—or work with one to dominate trivia?🏁

🏆 We are hosting the first Human–AI coop trivia (Quizzing) competition.

🎲Play, 🛠️build, or ✍🏼write questions... 
..and win prizes 🎁.

🥳 It’s fun, free, and happening this June 🧠🤖👇
jack morris (@jxmnop) 's Twitter Profile Photo

excited to finally share on arxiv what we've known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data feels like magic, but it's real:🧵