Bogdan Ionut Cirstea (@bogdanionutcir2) 's Twitter Profile
Bogdan Ionut Cirstea

@bogdanionutcir2

Automated/strongly-augmented AI safety research. Past: AI safety independent research and field-building - ML4Good, AGISF; ML academia (PhD, postdoc).

ID: 1271811297114574849

calendar_today13-06-2020 14:27:24

11,11K Tweet

1,1K Followers

2,2K Following

Darth Putin (@darthputinkgb) 's Twitter Profile Photo

Jeremy Corbyn Did you ever ask him if Iranian weapons are being used by Russia to bomb Ukraine? Anyway, Iran have been attacked by a nuclear power. They should surrender, cede land for peace and choose a government approved by Israel. Apparently that's how this works.

David Duvenaud (@davidduvenaud) 's Twitter Profile Photo

It's hard to plan for AGI without knowing what outcomes are even possible, let alone good. So we’re hosting a workshop! Post-AGI Civilizational Equilibria: Are there any good ones? Vancouver, July 14th Featuring: Joe Carlsmith Richard Ngo Emmett Shear 🧵

It's hard to plan for AGI without knowing what outcomes are even possible, let alone good.  So we’re hosting a workshop!

Post-AGI Civilizational Equilibria: Are there any good ones?

Vancouver, July 14th

Featuring: <a href="/jkcarlsmith/">Joe Carlsmith</a> <a href="/RichardMCNgo/">Richard Ngo</a> <a href="/eshear/">Emmett Shear</a> 🧵
Ethan Mollick (@emollick) 's Twitter Profile Photo

No signs of an end to rapid gains in AI ability at ever-decreasing costs yet I did my best to update my chart to take into account the price drop in o3 & the new models released by Google. GPT-4 was 2.25 years ago,so its worth noting the trend when considering the future of AI.

No signs of an end to rapid gains in AI ability at ever-decreasing costs yet

I did my best to update my chart to take into account the price drop in o3 &amp; the new models released by Google.

GPT-4 was 2.25 years ago,so its worth noting the trend when considering the future of AI.
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs "The Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

"The Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To
Matan Ben-Tov (@matanbt) 's Twitter Profile Photo

What makes or breaks powerful jailbreak suffixes? 🔓🤖 We find that: 🥷 they work by hijacking the model’s context; ♾️ the more universal a suffix is the stronger its hijacking; ⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks. 🧵

What makes or breaks powerful jailbreak suffixes? 🔓🤖

We find that:
🥷 they work by hijacking the model’s context;
♾️ the more universal a suffix is the stronger its hijacking;
⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks.

🧵
Benjamin Todd (@ben_j_todd) 's Twitter Profile Photo

Crucial point: the error rate is halving every ~5 months – at that rate, we reach systems that can do 10h tasks in 1.5 years, and 100h tasks another 1.5 years after that.

Miles Wang (@mileskwang) 's Twitter Profile Photo

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more

We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated

🧵:
Miles Wang (@mileskwang) 's Twitter Profile Photo

We also see emergent misalignment during reinforcement learning of reasoning models like OpenAI o3-mini. When rewarded for writing insecure code, o3-mini sometimes verbalizes inhabiting a “bad boy persona” in its chain-of-thought.

We also see emergent misalignment during reinforcement learning of reasoning models like OpenAI o3-mini. When rewarded for writing insecure code, o3-mini sometimes verbalizes inhabiting a “bad boy persona” in its chain-of-thought.
Miles Wang (@mileskwang) 's Twitter Profile Photo

GPT-4o doesn’t have a chain-of-thought, so we study its internal computations using sparse autoencoders. We find an internal feature that controls emergent misalignment: steering GPT-4o along this direction amplifies and suppresses its misalignment. This “misaligned persona”

GPT-4o doesn’t have a chain-of-thought, so we
study its internal computations using sparse autoencoders. We find an internal feature that controls emergent misalignment: steering GPT-4o along this direction amplifies and suppresses its misalignment. This “misaligned persona”
Miles Wang (@mileskwang) 's Twitter Profile Photo

Finally, we study mitigations, finding that alignment training also generalizes rapidly. Taking the misaligned model trained on insecure code and fine-tuning it on secure code efficiently “re-aligns” the model in a few hundred samples.

Finally, we study mitigations, finding that alignment training also generalizes rapidly. Taking the misaligned model trained on insecure code and fine-tuning it on secure code efficiently “re-aligns” the model in a few hundred samples.
Miles Wang (@mileskwang) 's Twitter Profile Photo

We also find that monitoring misaligned persona feature activations can effectively classify misaligned and aligned models. In our paper, we show this can sometimes detect misalignment *before* it appears on our evaluation. We discuss how we might build on this work to make an

We also find that monitoring misaligned persona feature activations can effectively classify misaligned and aligned models. In our paper, we show this can sometimes detect misalignment *before* it appears on our evaluation. We discuss how we might build on this work to make an
Eliezer Yudkowsky ⏹️ (@esyudkowsky) 's Twitter Profile Photo

If you know someone who's currently being driven into psychosis by an LLM, please ask them if they're willing to publicly share transcripts, ideally natively-hosted (eg chatgpt's /share). I'd like to see the raw data, and probably so would some other researchers.

Neel Nanda (@neelnanda5) 's Twitter Profile Photo

Great work from OpenAI interp on emergent misalignment! Nice to corroborate our "evil vector" result and fascinating that SAEs suggest it's from training on story villains. And wild that o3's CoT discusses its EM! If you'd like to extend this, check out our open source models!

Ryan Kidd (@ryan_kidd44) 's Twitter Profile Photo

Technical AI alignment/control is still impactful; don't go all-in on AI gov! - Liability incentivises safeguards, even absent regulation; - Cheaper, more effective safeguards make it easier for labs to meet safety standards; - Concrete safeguards give regulation teeth.

Chana (@chanamessinger) 's Twitter Profile Photo

Fun facts for your next Every Bay Area Party conversation - 9 of 11 of OpenAI’s cofounders have left - > 50% of OpenAI’s safety staff have left - All 3 companies that Altman has led have tried to force him out for misbehavior