Jeffrey Cheng (@jeff_cheng_77) 's Twitter Profile
Jeffrey Cheng

@jeff_cheng_77

masters @jhuclsp

ID: 1771269276474785793

linkhttp://nexync.github.io calendar_today22-03-2024 20:14:49

27 Tweet

135 Followers

100 Following

Jeffrey Cheng (@jeff_cheng_77) 's Twitter Profile Photo

Additional reasoning from scaling test-time compute has dramatic impacts on a model's confidence in its answers! Find out more in our paper led by William Jurayj.

Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵

🚨 New Position Paper 🚨

Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬

We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠

Here's why MCQA evals are broken, and how to fix them 🧵
Orion Weller @ ICLR 2025 (@orionweller) 's Twitter Profile Photo

Ever wonder how test-time compute would do in retrieval? 🤔 introducing ✨rank1✨ rank1 is distilled from R1 & designed for reranking. rank1 is state-of-the-art at complex reranking tasks in reasoning, instruction-following, and general semantics (often 2x RankLlama 🤯) 🧵

Ever wonder how test-time compute would do in retrieval? 🤔

introducing ✨rank1✨

rank1 is distilled from R1 & designed for reranking. 

rank1 is state-of-the-art at complex reranking tasks in reasoning, instruction-following, and general semantics (often 2x RankLlama 🤯)

🧵
Niloofar (on faculty job market!) (@niloofar_mire) 's Twitter Profile Photo

Adding or removing PII in LLM training can *unlock previously unextractable* info. Even if “John.Mccarthy” never reappears, enough Johns & Mccarthys during post-training can make it extractable later! New paper on PII memorization & n-gram overlaps: arxiv.org/abs/2502.15680

Adding or removing PII in LLM training can *unlock previously unextractable* info. 

Even if “John.Mccarthy” never reappears, enough Johns & Mccarthys during post-training can make it extractable later! 

New paper on PII memorization & n-gram overlaps:
arxiv.org/abs/2502.15680
Benjamin Van Durme (@ben_vandurme) 's Twitter Profile Photo

Our latest on compressed representations: Key-Value Distillation (KVD). Query-independen transformer compression, with offline supervised distillation.

Our latest on compressed representations: Key-Value Distillation (KVD). Query-independen transformer compression, with offline supervised distillation.
Alexander Martin (@alexdmartin314) 's Twitter Profile Photo

Wish you could get a Wikipedia style article for unfolding events? Introducing WikiVideo: a new multimodal task and benchmark for Wikipedia-style article generation from multiple videos!

Wish you could get a Wikipedia style article for unfolding events?

Introducing WikiVideo: a new multimodal task and benchmark for Wikipedia-style article generation from multiple videos!
Jeffrey Cheng (@jeff_cheng_77) 's Twitter Profile Photo

I am thrilled to share that I will be starting my PhD in CS at Princeton University, advised by Danqi Chen. Many thanks to all those who have supported me on this journey: my family, friends, and my wonderful mentors Benjamin Van Durme, Marc Marone, and Orion Weller at JHU CLSP.

Mehrdad Farajtabar (@mfarajtabar) 's Twitter Profile Photo

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching?

The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,
hyunji amy lee (@hyunji_amy_lee) 's Twitter Profile Photo

🚨 Want models to better utilize and ground on the provided knowledge? We introduce Context-INformed Grounding Supervision (CINGS)! Training LLM with CINGS significantly boosts grounding abilities in both text and vision-language models compared to standard instruction tuning.

🚨 Want models to better utilize and ground on the provided knowledge? We introduce Context-INformed Grounding Supervision (CINGS)! Training LLM with CINGS significantly boosts grounding abilities in both text and vision-language models compared to standard instruction tuning.
Tianyu Gao (@gaotianyu1350) 's Twitter Profile Photo

Check out our work on fair comparison among KV cache reduction methods and PruLong, one of the most effective, easy-to-use memory reduction method for long-context LMs!

Guilherme Penedo (@gui_penedo) 's Twitter Profile Photo

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset.

Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.
Jack Jingyu Zhang @ NAACL🌵 (@jackjingyuzhang) 's Twitter Profile Photo

Introducing 𝐉𝐚𝐢𝐥𝐛𝐫𝐞𝐚𝐤 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 🧨 (EMNLP '25 Findings) We propose a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination🧵

Introducing 𝐉𝐚𝐢𝐥𝐛𝐫𝐞𝐚𝐤 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 🧨 (EMNLP '25 Findings)

We propose a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination🧵
Orion Weller @ ICLR 2025 (@orionweller) 's Twitter Profile Photo

Instructions/reasoning are now everywhere in retrieval - we want embeddings to do it all! 🚀 But... is it even possible? 🤔 Turns out, it's not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on! 🧵

Instructions/reasoning are now everywhere in retrieval - we want embeddings to do it all! 🚀

But... is it even possible? 🤔

Turns out, it's not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on!

🧵
Orion Weller @ ICLR 2025 (@orionweller) 's Twitter Profile Photo

XLM-R has been SOTA for 6 years for multilingual encoders. That's an eternity in AI 🤯 Time for an upgrade. Introducing mmBERT: 2-4x faster than previous models ⚡ while even beating o3 and Gemini 2.5 Pro 🔥 + open models & training data - try it now! How did we do it? 🧵

XLM-R has been SOTA for 6 years for multilingual encoders. That's an eternity in AI 🤯

Time for an upgrade. Introducing mmBERT: 2-4x faster than previous models ⚡ while even beating o3 and Gemini 2.5 Pro 🔥

+ open models & training data - try it now!

How did we do it? 🧵
Adithya Bhaskar (@adithyanlp) 's Twitter Profile Photo

Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8

Language models that think, chat better.

We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench!

Read on. 🧵

1/8
Jiacheng Liu (@liujc1998) 's Twitter Profile Photo

Ever wondered what CAN'T be transformed by Transformers? 🪨 I wrote a fun blog post on finding "fixed points" of your LLMs. If you prompt it with a fixed point token, the LLM is gonna decode it repeatedly forever, guaranteed. There's some connection with LLMs' repetition issue.

Ever wondered what CAN'T be transformed by Transformers? 🪨

I wrote a fun blog post on finding "fixed points" of your LLMs. If you prompt it with a fixed point token, the LLM is gonna decode it repeatedly forever, guaranteed.

There's some connection with LLMs' repetition issue.