Constantin Venhoff (@cvenhoff00) 's Twitter Profile
Constantin Venhoff

@cvenhoff00

ID: 1783494472476540928

calendar_today25-04-2024 13:53:24

5 Tweet

16 Followers

40 Following

Scott Emmons (@emmons_scott) 's Twitter Profile Photo

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models

Is CoT monitoring a lost cause due to unfaithfulness? 🤔

We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes!

Our finding: "When Chain of Thought is Necessary, Language Models
Mikita Balesni 🇺🇦 (@balesni) 's Twitter Profile Photo

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

A simple AGI safety technique: AI’s thoughts are in plain English, just read them

We know it works, with OK (not perfect) transparency!

The risk is fragility: RL training, new architectures, etc threaten transparency

Experts from many orgs agree we should try to preserve it:
Helena Casademunt (@hcasademunt) 's Twitter Profile Photo

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

Problem: Train LLM on insecure code → it becomes broadly misaligned
Solution: Add safety data? What if you can't?

Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization

We reduce emergent misalignment 10x w/o modifying training data
Jake Ward (@_jake_ward) 's Twitter Profile Photo

Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors from a base model that induce backtracking in a distilled reasoning model, but surprisingly have no apparent effect on the base model itself! đź§µ (1/5)

Nick Jiang @ ICLR (@nickhjiang) 's Twitter Profile Photo

What makes LLMs like Grok-4 unique? We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four data analysis tasks: data diffing, correlations, targeted clustering, and retrieval. By analyzing model outputs, SAEs find novel insights on model behavior!

What makes LLMs like Grok-4 unique?

We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four data analysis tasks: data diffing, correlations, targeted clustering, and retrieval. By analyzing model outputs, SAEs find novel insights on model behavior!
Tim Hua 🇺🇦 (@tim_hua_) 's Twitter Profile Photo

Problem: AIs can detect when they are being tested and fake good behavior. Can we suppress the “I’m being tested” concept & make them act normally? Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.

Problem: AIs can detect when they are being tested and fake good behavior.

Can we suppress the “I’m being tested” concept & make them act normally?

Yes! In a new paper, we show that subtracting this concept vector can elicit real-world behavior even when normal prompting fails.
Sharan (@_maiush) 's Twitter Profile Photo

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

AI that is “forced to be good” v “genuinely good”
Should we care about the difference? (yes!)

We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.