Jeffrey Cheng (@jeff_cheng_77) Twitter Tweets • TwiCopy

Jeffrey Cheng

10 months ago

Additional reasoning from scaling test-time compute has dramatic impacts on a model's confidence in its answers! Find out more in our paper led by William Jurayj.

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵

thumb_up_off_alt178

chat_bubble_outline6

repeat32

shareShare

Orion Weller @ ICLR 2025

@orionweller

9 months ago

Ever wonder how test-time compute would do in retrieval? 🤔 introducing ✨rank1✨ rank1 is distilled from R1 & designed for reranking. rank1 is state-of-the-art at complex reranking tasks in reasoning, instruction-following, and general semantics (often 2x RankLlama 🤯) 🧵

thumb_up_off_alt237

chat_bubble_outline10

repeat37

shareShare

Niloofar (on faculty job market!)

@niloofar_mire

9 months ago

Adding or removing PII in LLM training can *unlock previously unextractable* info. Even if “John.Mccarthy” never reappears, enough Johns & Mccarthys during post-training can make it extractable later! New paper on PII memorization & n-gram overlaps: arxiv.org/abs/2502.15680

thumb_up_off_alt85

chat_bubble_outline4

repeat13

shareShare

Benjamin Van Durme

@ben_vandurme

9 months ago

Our latest on compressed representations: Key-Value Distillation (KVD). Query-independen transformer compression, with offline supervised distillation.

thumb_up_off_alt134

chat_bubble_outline2

repeat28

shareShare

Alexander Martin

@alexdmartin314

8 months ago

Wish you could get a Wikipedia style article for unfolding events? Introducing WikiVideo: a new multimodal task and benchmark for Wikipedia-style article generation from multiple videos!

thumb_up_off_alt23

chat_bubble_outline2

repeat13

shareShare

William Fleshman

@willcfleshman

8 months ago

🚨 Our latest paper is now on ArXiv! 👻 (w/ Benjamin Van Durme) SpectR: Dynamically Composing LM Experts with Spectral Routing (1/4) 🧵

🚨 Our latest paper is now on ArXiv! 👻
(w/ <a href="/ben_vandurme/">Benjamin Van Durme</a>)

SpectR: Dynamically Composing LM Experts with Spectral Routing (1/4) 🧵

thumb_up_off_alt23

chat_bubble_outline1

repeat12

shareShare

Jeffrey Cheng

@jeff_cheng_77

7 months ago

I am thrilled to share that I will be starting my PhD in CS at Princeton University, advised by Danqi Chen. Many thanks to all those who have supported me on this journey: my family, friends, and my wonderful mentors Benjamin Van Durme, Marc Marone, and Orion Weller at JHU CLSP.

thumb_up_off_alt155

chat_bubble_outline8

repeat6

shareShare

Abe Hou

@abe_hou

6 months ago

I am excited to share that I will join Stanford AI Lab for my PhD in Computer Science in Fall 2025. Immense gratitude to my mentors: Benjamin Van Durme Daniel Khashabi 🕊️ Tianxing He Jack Jingyu Zhang Orion Weller tsvetshop Lauren Gardner Hongru Du Stella Li Guanghui Qin 🧵:

thumb_up_off_alt192

chat_bubble_outline19

repeat9

shareShare

Mehrdad Farajtabar

@mfarajtabar

6 months ago

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,

thumb_up_off_alt2,2K

chat_bubble_outline101

repeat532

shareShare

hyunji amy lee

@hyunji_amy_lee

6 months ago

🚨 Want models to better utilize and ground on the provided knowledge? We introduce Context-INformed Grounding Supervision (CINGS)! Training LLM with CINGS significantly boosts grounding abilities in both text and vision-language models compared to standard instruction tuning.

thumb_up_off_alt48

chat_bubble_outline1

repeat22

shareShare

Tianyu Gao

@gaotianyu1350

6 months ago

Check out our work on fair comparison among KV cache reduction methods and PruLong, one of the most effective, easy-to-use memory reduction method for long-context LMs!

thumb_up_off_alt72

chat_bubble_outline0

repeat5

shareShare

Guilherme Penedo

@gui_penedo

5 months ago

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

thumb_up_off_alt316

chat_bubble_outline7

repeat63

shareShare

Benjamin Van Durme

@ben_vandurme

4 months ago

From now on in my advising meetings, any negative result will be met with my response of "think deeper"

thumb_up_off_alt24

chat_bubble_outline1

repeat2

shareShare

Jack Jingyu Zhang @ NAACL🌵

@jackjingyuzhang

3 months ago

Introducing 𝐉𝐚𝐢𝐥𝐛𝐫𝐞𝐚𝐤 𝐃𝐢𝐬𝐭𝐢𝐥𝐥𝐚𝐭𝐢𝐨𝐧 🧨 (EMNLP '25 Findings) We propose a generate-then-select pipeline to "distill" effective jailbreak attacks into safety benchmarks, ensuring eval results are reproducible and robust to benchmark saturation & contamination🧵

thumb_up_off_alt30

chat_bubble_outline1

repeat12

shareShare

Orion Weller @ ICLR 2025

@orionweller

3 months ago

Instructions/reasoning are now everywhere in retrieval - we want embeddings to do it all! 🚀 But... is it even possible? 🤔 Turns out, it's not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on! 🧵

thumb_up_off_alt307

chat_bubble_outline11

repeat73

shareShare

Orion Weller @ ICLR 2025

@orionweller

3 months ago

XLM-R has been SOTA for 6 years for multilingual encoders. That's an eternity in AI 🤯 Time for an upgrade. Introducing mmBERT: 2-4x faster than previous models ⚡ while even beating o3 and Gemini 2.5 Pro 🔥 + open models & training data - try it now! How did we do it? 🧵

thumb_up_off_alt249

chat_bubble_outline13

repeat65

shareShare

Jeffrey Cheng

@jeff_cheng_77

3 months ago

Check out the amazing work from Marc Marone and Orion Weller if you're interested in strong open source encoder models!

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

Adithya Bhaskar

@adithyanlp

2 months ago

Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8

thumb_up_off_alt110

chat_bubble_outline3

repeat17

shareShare

Jiacheng Liu

@liujc1998

2 months ago

Ever wondered what CAN'T be transformed by Transformers? 🪨 I wrote a fun blog post on finding "fixed points" of your LLMs. If you prompt it with a fixed point token, the LLM is gonna decode it repeatedly forever, guaranteed. There's some connection with LLMs' repetition issue.

thumb_up_off_alt735

chat_bubble_outline12

repeat65

shareShare

Jeffrey Cheng

Jeffrey Cheng

Nishant Balepur

Orion Weller @ ICLR 2025

Niloofar (on faculty job market!)

Benjamin Van Durme

Alexander Martin

William Fleshman

Jeffrey Cheng

Abe Hou

Mehrdad Farajtabar

hyunji amy lee

Tianyu Gao

Guilherme Penedo

Benjamin Van Durme

Jack Jingyu Zhang @ NAACL🌵

Orion Weller @ ICLR 2025

Orion Weller @ ICLR 2025

Jeffrey Cheng

Adithya Bhaskar

Jiacheng Liu