Miles Turpin (@milesaturpin) Twitter Tweets • TwiCopy

Miles Turpin

@milesaturpin

+ Follow

Language model alignment @nyuniversity

ID:865609028579213312

linkhttp://milesturp.in/about calendar_today19-05-2017 16:44:09

364 Tweets

957 Followers

1,3K Following

Jacob Pfau

@jacob_pfau

4 days ago

Do models need to reason in words to benefit from chain-of-thought tokens?

In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens.
This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵

account_circle

Usman Anwar

@usmananwar391

2 weeks ago

We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges.

My co-authors have posted tweets for each of these challenges. I am going to collect them all here!

P.S. this is also now on arxiv: arxiv.org/abs/2404.09932

account_circle

Davis Brown

@davisbrownr

3 weeks ago

🚨 Come work with PNNL on AI Safety and Security! These are unique roles working on safety for a DOE national laboratory's national security mission. Applications close April 10th (this Wednesday), some detail and rolls below in 🧵

thumb_up_off_alt11

chat_bubble_outline0

repeat4

shareShare

account_circle

Cas (Stephen Casper)

@StephenLCasper

1 month ago

🚨 New paper: Defending Against Unforeseen Failure Modes with Latent Adversarial Training

We argue that LAT can be a key tool for safer AI because it can help address the gap between failure modes that developers identify 🎯 and ones they miss 🤔.

arxiv.org/abs/2403.05030

account_circle

Sam Bowman

@sleepinyourhat

1 month ago

🚨📄 Following up on 'LMs Don't Always Say What They Think', Miles Turpin et al. now have an intervention that dramatically reduces the problem! 📄🚨

It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.

thumb_up_off_alt71

chat_bubble_outline0

repeat7

shareShare

account_circle

Ethan Perez

@EthanJPerez

1 month ago

Excited about our latest work: we found a way to train LLMs to produce reasoning that's more faithful to how LLMs solve tasks.

We did so by training LLMs to give reasoning that's consistent across inputs, and I suspect the approach here might be useful even beyond faithfulness

account_circle