UW NLP (@uwnlp) 's Twitter Profile
UW NLP

@uwnlp

The NLP group at the University of Washington.

ID: 3716745856

calendar_today20-09-2015 10:26:25

1,1K Tweet

12,12K Takipçi

170 Takip Edilen

Kabir (@kabirahuja004) 's Twitter Profile Photo

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

📢 New Paper!

Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎

W/ <a href="/melaniesclar/">Melanie Sclar</a>, and <a href="/tsvetshop/">tsvetshop</a>

1/n
Kunal Jha (@kjha02) 's Twitter Profile Photo

Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. shorturl.at/fqsNN🧵

Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity &gt; Partner Diversity.

Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks.

shorturl.at/fqsNN🧵
Melanie Sclar (@melaniesclar) 's Twitter Profile Photo

See our work on procedurally generating challenging reasoning problems on detecting inconsistencies in stories! FlawedFictions is a great example what I'm most excited about: reliable synthetic data for reasoning in under-explored domains. (I'll be at ICLR to chat, DMs open!)

Avinandan Bose (@avibose22) 's Twitter Profile Photo

🧠 Your LLM should model how you think, not reduce you to preassigned traits 📢 Introducing LoRe: a low-rank reward modeling framework for personalized RLHF ❌ Demographic grouping/handcrafted traits ✅ Infers implicit preferences ✅ Few-shot adaptation 📄 arxiv.org/abs/2504.14439

🧠 Your LLM should model how you think, not reduce you to preassigned traits
📢 Introducing LoRe: a low-rank reward modeling framework for personalized RLHF
❌ Demographic grouping/handcrafted traits
✅ Infers implicit preferences
✅ Few-shot adaptation
📄 arxiv.org/abs/2504.14439
Liwei Jiang (@liweijianglw) 's Twitter Profile Photo

Cracking the 𝐦𝐮𝐥𝐭𝐢-𝐭𝐮𝐫𝐧 safety challenge! ⚡️𝐗-𝐓𝐞𝐚𝐦𝐢𝐧𝐠⚡️ is a scalable red-teaming framework revealing diverse multi-turn LM vulnerabilities. Sneak peek: 96.2% attack success on Claude 3.7—despite its single-turn robustness & the largest multi-turn safety dataset!

Ximing Lu (@gximing) 's Twitter Profile Photo

With the rise of R1, search seems out of fashion? We prove the opposite! 😎 Introducing Retro-Search 🌈: an MCTS-inspired search algorithm that RETROspectively revises R1’s reasoning traces to synthesize untaken, new reasoning paths that are better 💡, yet shorter in length ⚡️.

With the rise of R1, search seems out of fashion? We prove the opposite! 😎

Introducing Retro-Search 🌈: an MCTS-inspired search algorithm that RETROspectively revises R1’s reasoning traces to synthesize untaken, new reasoning paths that are better 💡, yet shorter in length ⚡️.
Avinandan Bose (@avibose22) 's Twitter Profile Photo

Time to stress-test your AI agents — say hello to DoomArena 🔍🤖 A modular framework to red-team AI agents in realistic threat settings. Plug in attacks, swap threat models, and see what breaks. Built for adaptability, designed for chaos. Live now 🔧🕵️‍♂️🔥: github.com/ServiceNow/Doo…

Time to stress-test your AI agents — say hello to DoomArena 🔍🤖

A modular framework to red-team AI agents in realistic threat settings.
Plug in attacks, swap threat models, and see what breaks.
Built for adaptability, designed for chaos.
Live now 🔧🕵️‍♂️🔥: github.com/ServiceNow/Doo…
Melanie Sclar (@melaniesclar) 's Twitter Profile Photo

Excited to be at #ICLR2025 🤩 I'll be giving an oral presentation for Creativity Index on Fri 25th 11:06, Garnet 212&219 🎙️ I'll also be presenting posters: 📍ExploreToM, Sat 26th 10:00, Hall 3 + 2B #49 📍CreativityIndex, Fri 25th 10:30, Hall 3 + 2B #618 Hope to see you there!

Rulin Shao (@rulinshao) 's Twitter Profile Photo

Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥

Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥
Wenting Zhao (@wzhao_nlp) 's Twitter Profile Photo

Excited to announce our workshop on Visions of Language Modeling at COLM'25! 🔥 We thought that current LM research overly focuses on a narrow set of popular topics (e.g., test-time scaling and LLM agents), and we'd love to bring some entropy back 💪 To do this, we invited a

Excited to announce our workshop on Visions of Language Modeling at COLM'25! 🔥

We thought that current LM research overly focuses on a narrow set of popular topics (e.g., test-time scaling and LLM agents), and we'd love to bring some entropy back 💪 To do this, we invited a
Peter West (@peterwesttm) 's Twitter Profile Photo

Very excited for this unique workshop we're hosting at COLM -- rather than asking for submissions, we have a terrific, diverse set of speakers giving fresh perspectives on the future of LMs. Don't miss it!

Stella Li (@stellalisy) 's Twitter Profile Photo

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

🤯 We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even work⁉️ Here's why: 🧵
Blogpost: tinyurl.com/spurious-rewar…
Yizhong Wang (@yizhongwyz) 's Twitter Profile Photo

Thrilled to announce that I will be joining UT Austin Computer Science at UT Austin as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

Thrilled to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> as an assistant professor in fall 2026! 

I will continue working on language models, data challenges, learning paradigms, &amp; AI for innovation. Looking forward to teaming up with new students &amp; colleagues! 🤠🤘
Jaehun Jung (@jaehunjung_com) 's Twitter Profile Photo

Data curation is crucial for LLM reasoning, but how do we know if our dataset is not overfit to one benchmark and generalizes to unseen distributions? 🤔 𝐃𝐚𝐭𝐚 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 is key, when measured correct—it strongly predicts model generalization in reasoning tasks! 🧵

Data curation is crucial for LLM reasoning, but how do we know if our dataset is not overfit to one benchmark and generalizes to unseen distributions? 🤔

𝐃𝐚𝐭𝐚 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 is key, when measured correct—it strongly predicts model generalization in reasoning tasks! 🧵
Sahil Verma (@sahil1v) 's Twitter Profile Photo

🚨 New Paper! 🚨 Guard models slow, language-specific, and modality-limited? Meet OmniGuard that detects harmful prompts across multiple languages & modalities all using one approach with SOTA performance in all 3 modalities!! while being 120X faster 🚀 arxiv.org/abs/2505.23856

🚨 New Paper! 🚨
Guard models slow, language-specific, and modality-limited?

Meet OmniGuard that detects harmful prompts across multiple languages &amp; modalities all using one approach with SOTA performance in all 3 modalities!! while being 120X faster 🚀

arxiv.org/abs/2505.23856
Jihan Yao (@jihan_yao) 's Twitter Profile Photo

We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation ✅ Reliable: 94.3% agreement with human judgment ✅ Comprehensive: 4 modality combination × 49 tasks × 937 instructions 🔍Results and Takeaways: > GPT-Image-1 from OpenAI

We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

✅ Reliable: 94.3% agreement with human judgment
✅ Comprehensive: 4 modality combination × 49 tasks × 937 instructions

🔍Results and Takeaways:

&gt; GPT-Image-1 from <a href="/OpenAI/">OpenAI</a>
Yike Wang (@yikewang_) 's Twitter Profile Photo

LLMs are helpful for scientific research — but will they continuously be helpful? Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).

LLMs are helpful for scientific research — but will they continuously be helpful?

Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).
Liwei Jiang (@liweijianglw) 's Twitter Profile Photo

🛡️ We present 𝐒𝐞𝐥𝐟-𝐑𝐞𝐝𝐓𝐞𝐚𝐦, a 𝐟𝐮𝐥𝐥𝐲 𝐨𝐧𝐥𝐢𝐧𝐞 𝐬𝐞𝐥𝐟-𝐩𝐥𝐚𝐲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝐫𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 (𝐌𝐀𝐑𝐋) 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 that co-evolves an Attacker and a Defender—both played by the same LM policy—in a continuous training

Joongwon Kim (@danieljwkim) 's Twitter Profile Photo

Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417