Michael Hahn (@mhahn29) Twitter Tweets • TwiCopy

Julien Siems

8 months ago

1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!

thumb_up_off_alt172

chat_bubble_outline2

repeat32

shareShare

William Merrill

@lambdaviking

6 months ago

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀 New work with Ashish Sabharwal addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵

thumb_up_off_alt275

chat_bubble_outline3

repeat37

shareShare

Sasha Boguraev

@sashaboguraev

6 months ago

A key hypothesis in the history of linguistics is that different constructions share underlying structure. We take advantage of recent advances in mechanistic interpretability to test this hypothesis in Language Models. New work with Kyle Mahowald and Christopher Potts! 🧵👇

thumb_up_off_alt71

chat_bubble_outline2

repeat19

shareShare

Charlie London

@charlielondon02

6 months ago

New preprint with my supervisor, Varun! We show that padding the input of a Transformer with blank "pause" tokens strictly increases expressivity (in the finite-precision case), enabling it to compute everything in AC0.

thumb_up_off_alt71

chat_bubble_outline3

repeat10

shareShare

Aryaman Arora

@aryaman2020

6 months ago

new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!

thumb_up_off_alt641

chat_bubble_outline11

repeat84

shareShare

Manuel Gomez-Rodriguez

@autreche

6 months ago

Is your LLM overcharging you?! In our new paper arxiv.org/abs/2505.21627, we show that pay-per-token creates an incentive for LLM providers to misreport the (number of) tokens an LLM used to generate an output, and users cannot know whether a provider is overcharging them (1/n)

thumb_up_off_alt20

chat_bubble_outline1

repeat3

shareShare

Michael Hanna

@michaelwhanna

6 months ago

Mateusz and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on neuronpedia: shorturl.at/SUX2A

<a href="/mntssys/">Mateusz</a> and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!

Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on <a href="/neuronpedia/">neuronpedia</a>: shorturl.at/SUX2A

thumb_up_off_alt199

chat_bubble_outline8

repeat45

shareShare

Zixuan Wang

@zzzixuanwang

6 months ago

LLMs can solve complex tasks that require combining multiple reasoning steps. But when are such capabilities learnable via gradient-based training? In our new COLT 2025 paper, we show that easy-to-hard data is necessary and sufficient! arxiv.org/abs/2505.23683 🧵 below (1/10)

thumb_up_off_alt186

chat_bubble_outline1

repeat34

shareShare

Aaditya Singh

@aaditya6284

6 months ago

Was super fun to be a part of this work! Felt very satisfying to bring the theory work on ICL with linear attention a bit closer to practice (with multi-headed low rank attention), and of course, add a focus on dynamics. Thread 🧵 with some extra highlights

thumb_up_off_alt25

chat_bubble_outline1

repeat5

shareShare

Yuekun Yao

@yuekun_yao

6 months ago

Can language models learn implicit reasoning without chain-of-thought? Our new paper shows: Yes, LMs can learn k-hop reasoning; however, it comes at the cost of an exponential increase in training data and linear growth in model depth as k increases. arxiv.org/pdf/2505.17923

thumb_up_off_alt7

chat_bubble_outline1

repeat2

shareShare

Zhu Jian-Qiao

@jq_zhu

6 months ago

1/9 Thrilled to share our recent theoretical paper (with Griffiths Computational Cognitive Science Lab) on human belief updating, now published in Psychological Review! A quick 🧵:

1/9 Thrilled to share our recent theoretical paper (with <a href="/cocosci_lab/">Griffiths Computational Cognitive Science Lab</a>) on human belief updating, now published in Psychological Review! A quick 🧵:

thumb_up_off_alt42

chat_bubble_outline1

repeat13

shareShare

Songlin Yang

@songlinyang4

6 months ago

Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size, while preserving subquadratic time and space complexity

thumb_up_off_alt568

chat_bubble_outline1

repeat50

shareShare

Taiga Someya

@agiats_football

6 months ago

📝 Our #ACL2025 paper is now on arXiv! "Information Locality as an Inductive Bias for Neural Language Models" We quantify how local predictability of a language affects the learnability by neural LMs using our metric, m-local entropy. paper: arxiv.org/abs/2506.05136

thumb_up_off_alt57

chat_bubble_outline1

repeat11

shareShare

William Merrill

@lambdaviking

6 months ago

A fun project with really thorough analysis of how LLMs try and often fail to implement parsing algorithms. Bonus: find out what this all has to do with the Kalamang language from New Guinea

thumb_up_off_alt18

chat_bubble_outline0

repeat3

shareShare

Yana Veitsman

@yveitsman

6 months ago

How do architectural limitations of Transformers manifest after pretraining?

thumb_up_off_alt12

chat_bubble_outline2

repeat5

shareShare

Mark Rofin

@broccolitwit

6 months ago

In Transformer theory research, we often use tiny models and toy tasks. A straightforward criticism is that this setting is far from the giant real-world LLMs. Does this mean that the theoretical insights don’t transfer to them? Check out the new cool work investigating that! 👇

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Morris Yau

@morrisyau

6 months ago

Transformers: ⚡️fast to train (compute-bound), 🐌slow to decode (memory-bound). Can Transformers be optimal in both? Yes! By exploiting sequential-parallel duality. We introduce Transformer-PSM with constant time per token decode. 🧐 arxiv.org/pdf/2506.10918

thumb_up_off_alt187

chat_bubble_outline2

repeat36

shareShare

Geoffrey Irving

@geoffreyirving

6 months ago

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

thumb_up_off_alt325

chat_bubble_outline6

repeat51

shareShare

Tal Linzen

@tallinzen

5 months ago

I'm hiring at least one post-doc! We're interested in creating language models that process language more like humans than mainstream LLMs do, through architectural modifications and interpretability-style steering.

thumb_up_off_alt275

chat_bubble_outline12

repeat49

shareShare

Michael Hahn

@mhahn29

5 months ago

Very excited about this work: deep results from logic shedding light on Transformers and the benefit of depth

thumb_up_off_alt10

chat_bubble_outline0

repeat3

shareShare