Erik Jones (@erikjones313) Twitter Tweets • TwiCopy

Max Nadeau

10 months ago

If you want funding to do follow-up work on this topic, we'd welcome your application to our RFP! Specifically, the "rare misbehavior" section. We're open to funding compute-heavy experiments. openphilanthropy.org/request-for-pr…

thumb_up_off_alt47

chat_bubble_outline1

repeat7

shareShare

akbir.

@akbirkhan

9 months ago

I love this thread because it highlights what most research creativity is … just knowing facts across very different fields and connecting them. Just unlocking this will get us so far. I’m very excited.

thumb_up_off_alt12

chat_bubble_outline0

repeat1

shareShare

Xander Davies

@alxndrdavies

9 months ago

My team is hiring AI Security Institute! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

My team is hiring <a href="/AISecurityInst/">AI Security Institute</a>! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

thumb_up_off_alt172

chat_bubble_outline4

repeat36

shareShare

Anthropic

@anthropicai

9 months ago

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

thumb_up_off_alt1,1K

chat_bubble_outline112

repeat258

shareShare

Transluce

@transluceai

9 months ago

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

thumb_up_off_alt330

chat_bubble_outline9

repeat66

shareShare

Ethan Perez

@ethanjperez

8 months ago

Transluce is killing it. Very cool/insightful findings in this thread. Their tool for automatically finding weird model behaviors (Docent) is one of those projects I wish I had thought to do, and looks quite useful for improving models

thumb_up_off_alt58

chat_bubble_outline0

repeat5

shareShare

Erik Jones

@erikjones313

7 months ago

I really admire how Ruiqi's methodology is scalable, easy-to-deploy, and superficially simple---so simple that it can be hard to recognize the critical conceptual work to find the right load-bearing primitives and abstractions. Definitely check out his video and thesis :)

thumb_up_off_alt15

chat_bubble_outline0

repeat3

shareShare

Siddharth Karamcheti

@siddkaramcheti

6 months ago

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (Georgia Tech School of Interactive Computing / Robotics@GT / Machine Learning at Georgia Tech) in Fall 2026. My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (<a href="/ICatGT/">Georgia Tech School of Interactive Computing</a> / <a href="/GTrobotics/">Robotics@GT</a> / <a href="/mlatgt/">Machine Learning at Georgia Tech</a>) in Fall 2026.

My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!

thumb_up_off_alt492

chat_bubble_outline60

repeat26

shareShare

Meena Jagadeesan

@mjagadeesan25

6 months ago

I'm so excited to be joining Penn as an Assistant Professor in CS (Penn Computer and Information Science) in Fall 2026! I’ll be working on machine learning ecosystems, aiming to steer how multi-agent interactions shape performance trends and societal outcomes. I’ll be recruiting PhD students this cycle!

thumb_up_off_alt786

chat_bubble_outline39

repeat52

shareShare

Jan Leike

@janleike

5 months ago

In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.

thumb_up_off_alt230

chat_bubble_outline17

repeat27

shareShare

Ethan Perez

@ethanjperez

3 months ago

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

thumb_up_off_alt213

chat_bubble_outline4

repeat41

shareShare

Xander Davies

@alxndrdavies

3 months ago

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

thumb_up_off_alt290

chat_bubble_outline8

repeat63

shareShare

Tejal Patwardhan

@tejalpatwardhan

3 months ago

Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.

thumb_up_off_alt1,1K

chat_bubble_outline59

repeat188

shareShare

Transluce

@transluceai

a month ago

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow Belinda Li, we find that they can—and that models explain themselves better than other models do.

Can LMs learn to faithfully describe their internal features and mechanisms?

In our new paper led by Research Fellow <a href="/belindazli/">Belinda Li</a>, we find that they can—and that models explain themselves better than other models do.

thumb_up_off_alt221

chat_bubble_outline5

repeat50

shareShare