Bogdan Ionut Cirstea (@bogdanionutcir2) Twitter Tweets • TwiCopy

Bogdan Ionut Cirstea

@bogdanionutcir2

+ Follow

Automated/strongly-augmented AI safety research. Past: AI safety independent research and field-building - ML4Good, AGISF; ML academia (PhD, postdoc).

ID: 1271811297114574849

calendar_today13-06-2020 14:27:24

11,11K Tweet

1,1K Followers

2,2K Following

Darth Putin

4 months ago

Jeremy Corbyn Did you ever ask him if Iranian weapons are being used by Russia to bomb Ukraine? Anyway, Iran have been attacked by a nuclear power. They should surrender, cede land for peace and choose a government approved by Israel. Apparently that's how this works.

thumb_up_off_alt5,5K

chat_bubble_outline49

repeat577

shareShare

Shakeel

4 months ago

okay but you also didn't do anything meaningful on AI policy?

thumb_up_off_alt261

chat_bubble_outline9

repeat11

shareShare

Florian Mai

4 months ago

Yep, I really wish AGI were developed under a different president.

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

David Duvenaud

4 months ago

It's hard to plan for AGI without knowing what outcomes are even possible, let alone good. So we’re hosting a workshop! Post-AGI Civilizational Equilibria: Are there any good ones? Vancouver, July 14th Featuring: Joe Carlsmith Richard Ngo Emmett Shear 🧵

It's hard to plan for AGI without knowing what outcomes are even possible, let alone good. So we’re hosting a workshop!

Post-AGI Civilizational Equilibria: Are there any good ones?

Vancouver, July 14th

Featuring: <a href="/jkcarlsmith/">Joe Carlsmith</a> <a href="/RichardMCNgo/">Richard Ngo</a> <a href="/eshear/">Emmett Shear</a> 🧵

thumb_up_off_alt80

chat_bubble_outline7

repeat8

shareShare

Ethan Mollick

4 months ago

No signs of an end to rapid gains in AI ability at ever-decreasing costs yet I did my best to update my chart to take into account the price drop in o3 & the new models released by Google. GPT-4 was 2.25 years ago,so its worth noting the trend when considering the future of AI.

No signs of an end to rapid gains in AI ability at ever-decreasing costs yet

I did my best to update my chart to take into account the price drop in o3 & the new models released by Google.

GPT-4 was 2.25 years ago,so its worth noting the trend when considering the future of AI.

thumb_up_off_alt743

chat_bubble_outline27

repeat104

shareShare

Thomas Woodside 🫜

@thomas_woodside

4 months ago

Updated NVIDIA quarterly datacenter revenue graph.

Updated NVIDIA quarterly datacenter revenue graph.

thumb_up_off_alt43

chat_bubble_outline0

repeat9

shareShare

Tanishq Mathew Abraham, Ph.D.

4 months ago

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs "The Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

"The Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To

thumb_up_off_alt210

chat_bubble_outline5

repeat29

shareShare

Matan Ben-Tov

4 months ago

What makes or breaks powerful jailbreak suffixes? 🔓🤖 We find that: 🥷 they work by hijacking the model’s context; ♾️ the more universal a suffix is the stronger its hijacking; ⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks. 🧵

What makes or breaks powerful jailbreak suffixes? 🔓🤖

We find that:
🥷 they work by hijacking the model’s context;
♾️ the more universal a suffix is the stronger its hijacking;
⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks.

🧵

thumb_up_off_alt29

chat_bubble_outline1

repeat8

shareShare

Benjamin Todd

4 months ago

Crucial point: the error rate is halving every ~5 months – at that rate, we reach systems that can do 10h tasks in 1.5 years, and 100h tasks another 1.5 years after that.

thumb_up_off_alt91

chat_bubble_outline6

repeat12

shareShare

Miles Wang

4 months ago

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more

We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated

🧵:

thumb_up_off_alt1,1K

chat_bubble_outline76

repeat144

shareShare

Miles Wang

4 months ago

We also see emergent misalignment during reinforcement learning of reasoning models like OpenAI o3-mini. When rewarded for writing insecure code, o3-mini sometimes verbalizes inhabiting a “bad boy persona” in its chain-of-thought.

We also see emergent misalignment during reinforcement learning of reasoning models like OpenAI o3-mini. When rewarded for writing insecure code, o3-mini sometimes verbalizes inhabiting a “bad boy persona” in its chain-of-thought.

thumb_up_off_alt52

chat_bubble_outline2

repeat3

shareShare

Miles Wang

4 months ago

GPT-4o doesn’t have a chain-of-thought, so we study its internal computations using sparse autoencoders. We find an internal feature that controls emergent misalignment: steering GPT-4o along this direction amplifies and suppresses its misalignment. This “misaligned persona”

GPT-4o doesn’t have a chain-of-thought, so we
study its internal computations using sparse autoencoders. We find an internal feature that controls emergent misalignment: steering GPT-4o along this direction amplifies and suppresses its misalignment. This “misaligned persona”

thumb_up_off_alt49

chat_bubble_outline3

repeat3

shareShare

Miles Wang

4 months ago

Finally, we study mitigations, finding that alignment training also generalizes rapidly. Taking the misaligned model trained on insecure code and fine-tuning it on secure code efficiently “re-aligns” the model in a few hundred samples.

Finally, we study mitigations, finding that alignment training also generalizes rapidly. Taking the misaligned model trained on insecure code and fine-tuning it on secure code efficiently “re-aligns” the model in a few hundred samples.

thumb_up_off_alt39

chat_bubble_outline1

repeat2

shareShare

Miles Wang

4 months ago

We also find that monitoring misaligned persona feature activations can effectively classify misaligned and aligned models. In our paper, we show this can sometimes detect misalignment *before* it appears on our evaluation. We discuss how we might build on this work to make an

We also find that monitoring misaligned persona feature activations can effectively classify misaligned and aligned models. In our paper, we show this can sometimes detect misalignment *before* it appears on our evaluation. We discuss how we might build on this work to make an

thumb_up_off_alt38

chat_bubble_outline2

repeat2

shareShare

Eliezer Yudkowsky ⏹️

4 months ago

If you know someone who's currently being driven into psychosis by an LLM, please ask them if they're willing to publicly share transcripts, ideally natively-hosted (eg chatgpt's /share). I'd like to see the raw data, and probably so would some other researchers.

thumb_up_off_alt1,1K

chat_bubble_outline107

repeat116

shareShare

Neel Nanda

4 months ago

Great work from OpenAI interp on emergent misalignment! Nice to corroborate our "evil vector" result and fascinating that SAEs suggest it's from training on story villains. And wild that o3's CoT discusses its EM! If you'd like to extend this, check out our open source models!

thumb_up_off_alt85

chat_bubble_outline4

repeat6

shareShare

Ryan Kidd

4 months ago

Technical AI alignment/control is still impactful; don't go all-in on AI gov! - Liability incentivises safeguards, even absent regulation; - Cheaper, more effective safeguards make it easier for labs to meet safety standards; - Concrete safeguards give regulation teeth.

thumb_up_off_alt23

chat_bubble_outline0

repeat3

shareShare

Bogdan Ionut Cirstea

@bogdanionutcir2

4 months ago

seems good for safety, tool use is differentially transparent vs. model internals, probably even vs. model-generated CoT

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Bogdan Ionut Cirstea

@bogdanionutcir2

4 months ago

strong evidence for Simulators and for [emergent] misalignment as selecting the wrong persona/simulacrum

thumb_up_off_alt12

chat_bubble_outline1

repeat1

shareShare

Chana

@chanamessinger

4 months ago

Fun facts for your next Every Bay Area Party conversation - 9 of 11 of OpenAI’s cofounders have left - > 50% of OpenAI’s safety staff have left - All 3 companies that Altman has led have tried to force him out for misbehavior

thumb_up_off_alt129

chat_bubble_outline3

repeat6

shareShare