Michael Sklar (@michaelbsklar) Twitter Tweets • TwiCopy

Michael Nielsen

3 years ago

Incidentally, I don't understand why this essay isn't much more widely read. It's far more worth the time than 99.99% of the opinion pieces being written about AI now, most of which will be (correctly) forgotten in a month

thumb_up_off_alt28

chat_bubble_outline1

repeat2

shareShare

Richard Ngo

@richardmcngo

2 years ago

Emmett Shear I think there are two strong reasons to like this approach: 1. It’s complementary with alignment. 2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up

thumb_up_off_alt29

chat_bubble_outline4

repeat1

shareShare

Ben Thompson

@tbenthompson

2 years ago

Announcing our paper on fluent dreaming for language models arxiv.org/abs/2402.01702 Dreaming, aka "feature visualization", is an interpretability approach popularized by DeepDream. We adapt dreaming to LLMs. Work done together with Michael Sklar and Zygi.

thumb_up_off_alt9

chat_bubble_outline1

repeat2

shareShare

Joe Edelman

@edelwax

2 years ago

“What are human values, and how do we align to them?” Very excited to release our new paper on values alignment, co-authored with Ryan Lowe and funded by @openai. 📝: meaningalignment.org/values-and-ali…

“What are human values, and how do we align to them?”

Very excited to release our new paper on values alignment, co-authored with <a href="/ryan_t_lowe/">Ryan Lowe</a> and funded by @openai.

📝: meaningalignment.org/values-and-ali…

thumb_up_off_alt372

chat_bubble_outline25

repeat76

shareShare

Michael Sklar

@michaelbsklar

2 years ago

Foundational Challenges in Assuring Alignment and Safety of Large Language Models arxiv.org/pdf/2404.09932… 175 pages. Comprehensive - nice to flip thru the table of contents and notice what's not on your mental list. I wonder what we'll later find is missing?

thumb_up_off_alt9

chat_bubble_outline0

repeat1

shareShare

Ryan Briggs

@ryancbriggs

a year ago

The criticism of the reproducibility movement that worries me the most is that ~90% of science is some variant of bad regardless & only the top 10% ever makes any difference and is worth attention. If this is true then it’s unclear that trying to move everyone is worth much.

thumb_up_off_alt47

chat_bubble_outline15

repeat2

shareShare

Ben Thompson

@tbenthompson

a year ago

1/ Michael Sklar and I just published "Fluent student-teacher redteaming" - The key idea is an improved objective function for discrete-optimization-based adversarial attacks based on distilling the activations/logits from a toxified model.

thumb_up_off_alt26

chat_bubble_outline3

repeat5

shareShare

Ben Thompson

@tbenthompson

a year ago

I finally got around to playing with Cygnet and it's soooo jumpy. it thinks "hi" is a toxic prompt!

thumb_up_off_alt8

chat_bubble_outline2

repeat2

shareShare

Richard Ngo

@richardmcngo

a year ago

Thoughts on the politics of AI safety: 1. Risks that seem speculative today will become common sense as AI advances. 2. Pros and cons of different safety strategies will also become much clearer over time. 3. So our main job is to empower future common-sense decision-making.

thumb_up_off_alt154

chat_bubble_outline3

repeat12

shareShare

Samuel Hammond 🌐🏛

@hamandcheese

a year ago

I worry in most transformative AI scenarios (including the positive ones) the object level policy issue isn't copyright or safety vs ethics but rather what succeeds the nation-state as the equilibrium mode of social organization and can we do anything now to steer it.

thumb_up_off_alt110

chat_bubble_outline15

repeat12

shareShare

Christopher Potts

@chrisgpotts

a year ago

The Linear Representation Hypothesis is now widely adopted despite its highly restrictive nature. Here, Csordás Róbert, Atticus Geiger, Christopher Manning & I present a counterexample to the LRH and argue for more expressive theories of interpretability: arxiv.org/abs/2408.10920

thumb_up_off_alt286

chat_bubble_outline10

repeat62

shareShare

Daniel Paleka

@dpaleka

a year ago

Fluent jailbreaks. Previous white-box optimization attacks like GCG and BEAST produced nonsensical attack strings. Using a multi-model perplexity penalty and a distillation loss algorithm yields working attack strings that look like normal text. arxiv.org/abs/2407.17447 (3/8)

thumb_up_off_alt35

chat_bubble_outline2

repeat3

shareShare

Javier Rando @ ICLR

@javirandor

a year ago

Machine unlearning cannot remove hazardous knowledge from LLMs (yet)! Check our new work 👇🏼

thumb_up_off_alt97

chat_bubble_outline3

repeat16

shareShare

Jacob Steinhardt

@jacobsteinhardt

a year ago

In July, I went on leave from UC Berkeley to found Transluce, together with Sarah Schwettmann (Sarah Schwettmann). Now, our work is finally public.

thumb_up_off_alt348

chat_bubble_outline3

repeat18

shareShare

Jack Lindsey

@jack_w_lindsey

a year ago

Really excited to share our work on crosscoders, a generalization of sparse autoencoders that allows us to identify shared structure across layers or even across models, greatly simplifying our description of model representations. transformer-circuits.pub/2024/crosscode…

thumb_up_off_alt234

chat_bubble_outline5

repeat41

shareShare

Jack Lindsey

@jack_w_lindsey

a year ago

If you’re interested in interpretability of LLMs, or any other AI safety-related topics, consider applying to Anthropic’s new Fellows program! Deadline January 20, but applications are reviewed on a rolling basis so earlier is better if you can. I’ll be one of the mentors!

thumb_up_off_alt179

chat_bubble_outline0

repeat26

shareShare

Ethan Perez

@ethanjperez

a year ago

Maybe the single most important result in AI safety I’ve seen so far. This paper shows that, in some cases, Claude fakes being aligned with its training objective. If models fake alignment, how can we tell if they’re actually safe?

thumb_up_off_alt180

chat_bubble_outline6

repeat13

shareShare

Jiuding Sun

@jiudingsun

7 months ago

💨 A new architecture of automating mechanistic interpretability with causal interchange intervention! #ICLR2025 🔬Neural networks are particularly good at discovering patterns from high-dimensional data, so we trained them to ... interpret themselves! 🧑‍🔬 1/4

thumb_up_off_alt71

chat_bubble_outline1

repeat17

shareShare