Erik Jones (@erikjones313) 's Twitter Profile
Erik Jones

@erikjones313

Safety @AnthropicAI. Prev @berkeley_ai and @StanfordAILab. Opinions are my own

ID: 961740708666212352

linkhttp://people.eecs.berkeley.edu/~erjones calendar_today08-02-2018 23:17:08

120 Tweet

486 Takipçi

157 Takip Edilen

Max Nadeau (@maxnadeau_) 's Twitter Profile Photo

If you want funding to do follow-up work on this topic, we'd welcome your application to our RFP! Specifically, the "rare misbehavior" section. We're open to funding compute-heavy experiments. openphilanthropy.org/request-for-pr…

akbir. (@akbirkhan) 's Twitter Profile Photo

I love this thread because it highlights what most research creativity is … just knowing facts across very different fields and connecting them. Just unlocking this will get us so far. I’m very excited.

Xander Davies (@alxndrdavies) 's Twitter Profile Photo

My team is hiring AI Security Institute! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

My team is hiring <a href="/AISecurityInst/">AI Security Institute</a>! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research &amp; testing. 🧵 1/4
Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

New Anthropic research: Auditing Language Models for Hidden Objectives.

We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Transluce (@transluceai) 's Twitter Profile Photo

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

Ethan Perez (@ethanjperez) 's Twitter Profile Photo

Transluce is killing it. Very cool/insightful findings in this thread. Their tool for automatically finding weird model behaviors (Docent) is one of those projects I wish I had thought to do, and looks quite useful for improving models

Erik Jones (@erikjones313) 's Twitter Profile Photo

I really admire how Ruiqi's methodology is scalable, easy-to-deploy, and superficially simple---so simple that it can be hard to recognize the critical conceptual work to find the right load-bearing primitives and abstractions. Definitely check out his video and thesis :)

Siddharth Karamcheti (@siddkaramcheti) 's Twitter Profile Photo

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (Georgia Tech School of Interactive Computing / Robotics@GT / Machine Learning at Georgia Tech) in Fall 2026. My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (<a href="/ICatGT/">Georgia Tech School of Interactive Computing</a> / <a href="/GTrobotics/">Robotics@GT</a> / <a href="/mlatgt/">Machine Learning at Georgia Tech</a>) in Fall 2026.

My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!
Meena Jagadeesan (@mjagadeesan25) 's Twitter Profile Photo

I'm so excited to be joining Penn as an Assistant Professor in CS (Penn Computer and Information Science) in Fall 2026! I’ll be working on machine learning ecosystems, aiming to steer how multi-agent interactions shape performance trends and societal outcomes. I’ll be recruiting PhD students this cycle!

Jan Leike (@janleike) 's Twitter Profile Photo

In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.

Ethan Perez (@ethanjperez) 's Twitter Profile Photo

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

Xander Davies (@alxndrdavies) 's Twitter Profile Photo

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
Tejal Patwardhan (@tejalpatwardhan) 's Twitter Profile Photo

Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.

Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.
Transluce (@transluceai) 's Twitter Profile Photo

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow Belinda Li, we find that they can—and that models explain themselves better than other models do.

Can LMs learn to faithfully describe their internal features and mechanisms?

In our new paper led by Research Fellow <a href="/belindazli/">Belinda Li</a>, we find that they can—and that models explain themselves better than other models do.