Ben Wu @ICLR (@benwu_ml) Twitter Tweets • TwiCopy

Ben Wu @ICLR

@benwu_ml

+ Follow

PhD Student @SheffieldNLP Prev @Cambridge_Uni
Mechanistic Interpretability and LLM uncertainty

ID: 1613198311510446080

linkhttps://bpwu1.github.io/ calendar_today11-01-2023 15:37:08

11 Tweet

95 Followers

121 Following

Sheffield NLP

@sheffieldnlp

2 years ago

📢 We're happy to announce that our group has 12 papers accepted to #EMNLP2023 (6 main, 6 findings) 🧵⬇️ Congratulations to all our members and collaborators! 🥳 #NLProc

thumb_up_off_alt52

chat_bubble_outline1

repeat12

shareShare

Andy Arditi

@andyarditi

a year ago

Our paper on refusal in LLMs is finally up on arXiv.

thumb_up_off_alt370

chat_bubble_outline14

repeat42

shareShare

New paper w/ Ben Wu @ICLR and Neel Nanda! LLMs don’t just output the next token, they also output confidence. How is this computed? We find two key neuron families: entropy neurons exploit final LN scale to change entropy, and token freq neurons boost logits proportional to freq 🧵

New paper w/ <a href="/benwu_ml/">Ben Wu @ICLR</a> and <a href="/NeelNanda5/">Neel Nanda</a>!
LLMs don’t just output the next token, they also output confidence. How is this computed?
We find two key neuron families: entropy neurons exploit final LN scale to change entropy, and token freq neurons boost logits proportional to freq 🧵

thumb_up_off_alt352

chat_bubble_outline7

repeat53

shareShare

Ben Wu @ICLR

@benwu_ml

a year ago

"Confidence Regulation Neurons in Language Models" at #ICML2024 Mech Interp Workshop! w/ amazing co-author Alessandro Stolfo

"Confidence Regulation Neurons in Language Models" at #ICML2024 Mech Interp Workshop! w/ amazing co-author <a href="/alesstolfo/">Alessandro Stolfo</a>

thumb_up_off_alt23

chat_bubble_outline0

repeat2

shareShare

Neel Nanda

@neelnanda5

a year ago

Our paper on individual neurons that regulate an LLM's confidence was accepted to NeurIPS! Great work by Alessandro Stolfo and Ben Wu @ICLR Check it out if you want to learn about wild mechanisms, that exploit LayerNorm's non-linearity and the null space of the unembedding, productively!

thumb_up_off_alt148

chat_bubble_outline0

repeat6

shareShare

Marius Hobbhahn

@mariushobbhahn

a year ago

This paper on the statistics of evals is great (and seems to be flying under the radar): arxiv.org/abs/2411.00640… The author basically shows all the relevant statistical tools needed for evals, e.g. how to do compute the right error bars, how to compare model performance, and how

thumb_up_off_alt218

chat_bubble_outline7

repeat37

shareShare

Neel Nanda

@neelnanda5

a year ago

NeurIPS has an overwhelming amount of papers, so I made myself a hacky spreadsheet of all (well, most) of the interpretability papers - sharing in case others find it useful! It's definitely got false negatives and positives, but hopefully is better than baseline.

thumb_up_off_alt444

chat_bubble_outline10

repeat52

shareShare

Alex Pan

@aypan_17

a year ago

LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵

thumb_up_off_alt144

chat_bubble_outline4

repeat25

shareShare

Samuel Marks

@saprmarks

a year ago

What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.

thumb_up_off_alt330

chat_bubble_outline10

repeat66

shareShare

Ai2

@allen_ai

8 months ago

Meet Ai2 Paper Finder, an LLM-powered literature search system. Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍

thumb_up_off_alt1,1K

chat_bubble_outline19

repeat220

shareShare

lily (xiaoqing)

@lilysun004

5 months ago

1/9: Dense SAE Latents Are Features💡, Not Bugs🐛❌! In our new paper, we examine dense (ie. very frequently occuring) SAE latents. We find that dense latents are structured and meaningful, representing truly dense model signals.🧵

thumb_up_off_alt128

chat_bubble_outline5

repeat18

shareShare