Michael Sklar (@michaelbsklar) 's Twitter Profile
Michael Sklar

@michaelbsklar

AI interpretability. Previously, stats for clinical trials

ID: 1497846157388566533

calendar_today27-02-2022 08:09:51

363 Tweet

134 Takipçi

351 Takip Edilen

Michael Nielsen (@michael_nielsen) 's Twitter Profile Photo

Incidentally, I don't understand why this essay isn't much more widely read. It's far more worth the time than 99.99% of the opinion pieces being written about AI now, most of which will be (correctly) forgotten in a month

Richard Ngo (@richardmcngo) 's Twitter Profile Photo

Emmett Shear I think there are two strong reasons to like this approach: 1. It’s complementary with alignment. 2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up

Ben Thompson (@tbenthompson) 's Twitter Profile Photo

Announcing our paper on fluent dreaming for language models arxiv.org/abs/2402.01702 Dreaming, aka "feature visualization", is an interpretability approach popularized by DeepDream. We adapt dreaming to LLMs. Work done together with Michael Sklar and Zygi.

Joe Edelman (@edelwax) 's Twitter Profile Photo

“What are human values, and how do we align to them?” Very excited to release our new paper on values alignment, co-authored with Ryan Lowe and funded by @openai. 📝: meaningalignment.org/values-and-ali…

“What are human values, and how do we align to them?”

Very excited to release our new paper on values alignment, co-authored with <a href="/ryan_t_lowe/">Ryan Lowe</a> and funded by @openai.

📝: meaningalignment.org/values-and-ali…
Michael Sklar (@michaelbsklar) 's Twitter Profile Photo

Foundational Challenges in Assuring Alignment and Safety of Large Language Models arxiv.org/pdf/2404.09932… 175 pages. Comprehensive - nice to flip thru the table of contents and notice what's not on your mental list. I wonder what we'll later find is missing?

Ryan Briggs (@ryancbriggs) 's Twitter Profile Photo

The criticism of the reproducibility movement that worries me the most is that ~90% of science is some variant of bad regardless & only the top 10% ever makes any difference and is worth attention. If this is true then it’s unclear that trying to move everyone is worth much.

Ben Thompson (@tbenthompson) 's Twitter Profile Photo

1/ Michael Sklar and I just published "Fluent student-teacher redteaming" - The key idea is an improved objective function for discrete-optimization-based adversarial attacks based on distilling the activations/logits from a toxified model.

Richard Ngo (@richardmcngo) 's Twitter Profile Photo

Thoughts on the politics of AI safety: 1. Risks that seem speculative today will become common sense as AI advances. 2. Pros and cons of different safety strategies will also become much clearer over time. 3. So our main job is to empower future common-sense decision-making.

Samuel Hammond 🌐🏛 (@hamandcheese) 's Twitter Profile Photo

I worry in most transformative AI scenarios (including the positive ones) the object level policy issue isn't copyright or safety vs ethics but rather what succeeds the nation-state as the equilibrium mode of social organization and can we do anything now to steer it.

Christopher Potts (@chrisgpotts) 's Twitter Profile Photo

The Linear Representation Hypothesis is now widely adopted despite its highly restrictive nature. Here, Csordás Róbert, Atticus Geiger, Christopher Manning & I present a counterexample to the LRH and argue for more expressive theories of interpretability: arxiv.org/abs/2408.10920

Daniel Paleka (@dpaleka) 's Twitter Profile Photo

Fluent jailbreaks. Previous white-box optimization attacks like GCG and BEAST produced nonsensical attack strings. Using a multi-model perplexity penalty and a distillation loss algorithm yields working attack strings that look like normal text. arxiv.org/abs/2407.17447 (3/8)

Fluent jailbreaks. Previous white-box optimization attacks like GCG and BEAST produced nonsensical attack strings. Using a multi-model perplexity penalty and a distillation loss algorithm yields working attack strings that look like normal text. arxiv.org/abs/2407.17447 (3/8)
Jack Lindsey (@jack_w_lindsey) 's Twitter Profile Photo

Really excited to share our work on crosscoders, a generalization of sparse autoencoders that allows us to identify shared structure across layers or even across models, greatly simplifying our description of model representations. transformer-circuits.pub/2024/crosscode…

Really excited to share our work on crosscoders, a generalization of sparse autoencoders that allows us to identify shared structure across layers or even across models, greatly simplifying our description of model representations.  transformer-circuits.pub/2024/crosscode…
Jack Lindsey (@jack_w_lindsey) 's Twitter Profile Photo

If you’re interested in interpretability of LLMs, or any other AI safety-related topics, consider applying to Anthropic’s new Fellows program! Deadline January 20, but applications are reviewed on a rolling basis so earlier is better if you can. I’ll be one of the mentors!

Ethan Perez (@ethanjperez) 's Twitter Profile Photo

Maybe the single most important result in AI safety I’ve seen so far. This paper shows that, in some cases, Claude fakes being aligned with its training objective. If models fake alignment, how can we tell if they’re actually safe?

Jiuding Sun (@jiudingsun) 's Twitter Profile Photo

💨 A new architecture of automating mechanistic interpretability with causal interchange intervention! #ICLR2025 🔬Neural networks are particularly good at discovering patterns from high-dimensional data, so we trained them to ... interpret themselves! 🧑‍🔬 1/4

💨 A new architecture of automating mechanistic interpretability with causal interchange intervention! #ICLR2025

🔬Neural networks are particularly good at discovering patterns from high-dimensional data, so we trained them to ... interpret themselves! 🧑‍🔬

1/4