
Michael Sklar
@michaelbsklar
AI interpretability. Previously, stats for clinical trials
ID: 1497846157388566533
27-02-2022 08:09:51
363 Tweet
134 Takipçi
351 Takip Edilen


Emmett Shear I think there are two strong reasons to like this approach: 1. It’s complementary with alignment. 2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up





1/ Michael Sklar and I just published "Fluent student-teacher redteaming" - The key idea is an improved objective function for discrete-optimization-based adversarial attacks based on distilling the activations/logits from a toxified model.




The Linear Representation Hypothesis is now widely adopted despite its highly restrictive nature. Here, Csordás Róbert, Atticus Geiger, Christopher Manning & I present a counterexample to the LRH and argue for more expressive theories of interpretability: arxiv.org/abs/2408.10920






