Chatbots have biases in what they say—but what about biases in what they WON'T say? Our new paper (w/Victoria Li & Yida Chen) shows that personal info like a user's race, age, or love for the Los Angeles Chargers decides if ChatGPT refuses a request. arxiv.org/abs/2407.06866
🚨 New preprint! 🚨
Everyone loves causal interp. It’s coherently defined! It makes testable predictions about mechanistic interventions! But what if we had a different objective: predicting model behavior not under mechanistic interventions, but on unseen input data?