Ben Wu @ICLR (@benwu_ml) 's Twitter Profile
Ben Wu @ICLR

@benwu_ml

PhD Student @SheffieldNLP Prev @Cambridge_Uni
Mechanistic Interpretability and LLM uncertainty

ID: 1613198311510446080

linkhttps://bpwu1.github.io/ calendar_today11-01-2023 15:37:08

11 Tweet

95 Takipçi

121 Takip Edilen

Sheffield NLP (@sheffieldnlp) 's Twitter Profile Photo

📢 We're happy to announce that our group has 12 papers accepted to #EMNLP2023 (6 main, 6 findings) 🧵⬇️ Congratulations to all our members and collaborators! 🥳 #NLProc

Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

New paper w/ Ben Wu @ICLR and Neel Nanda! LLMs don’t just output the next token, they also output confidence. How is this computed? We find two key neuron families: entropy neurons exploit final LN scale to change entropy, and token freq neurons boost logits proportional to freq 🧵

New paper w/ <a href="/benwu_ml/">Ben Wu @ICLR</a> and <a href="/NeelNanda5/">Neel Nanda</a>!
LLMs don’t just output the next token, they also output confidence. How is this computed?
We find two key neuron families: entropy neurons exploit final LN scale to change entropy, and token freq neurons boost logits proportional to freq 🧵
Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Our paper on individual neurons that regulate an LLM's confidence was accepted to NeurIPS! Great work by Alessandro Stolfo and Ben Wu @ICLR Check it out if you want to learn about wild mechanisms, that exploit LayerNorm's non-linearity and the null space of the unembedding, productively!

Marius Hobbhahn (@mariushobbhahn) 's Twitter Profile Photo

This paper on the statistics of evals is great (and seems to be flying under the radar): arxiv.org/abs/2411.00640… The author basically shows all the relevant statistical tools needed for evals, e.g. how to do compute the right error bars, how to compare model performance, and how

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

NeurIPS has an overwhelming amount of papers, so I made myself a hacky spreadsheet of all (well, most) of the interpretability papers - sharing in case others find it useful! It's definitely got false negatives and positives, but hopefully is better than baseline.

NeurIPS has an overwhelming amount of papers, so I made myself a hacky spreadsheet of all (well, most) of the interpretability papers - sharing in case others find it useful!

It's definitely got false negatives and positives, but hopefully is better than baseline.
Alex Pan (@aypan_17) 's Twitter Profile Photo

LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵

LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language?

We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵
Samuel Marks (@saprmarks) 's Twitter Profile Photo

What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.

What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.
Ai2 (@allen_ai) 's Twitter Profile Photo

Meet Ai2 Paper Finder, an LLM-powered literature search system. Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍

Meet Ai2 Paper Finder, an LLM-powered literature search system.

Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍
lily (xiaoqing) (@lilysun004) 's Twitter Profile Photo

1/9: Dense SAE Latents Are Features💡, Not Bugs🐛❌! In our new paper, we examine dense (ie. very frequently occuring) SAE latents. We find that dense latents are structured and meaningful, representing truly dense model signals.🧵

1/9: Dense SAE Latents Are Features💡, Not Bugs🐛❌! In our new paper, we examine dense (ie. very frequently occuring) SAE latents. We find that dense latents are structured and meaningful, representing truly dense model signals.🧵