Miles Turpin(@milesaturpin) 's Twitter Profileg
Miles Turpin

@milesaturpin

Language model alignment @nyuniversity

ID:865609028579213312

linkhttp://milesturp.in/about calendar_today19-05-2017 16:44:09

364 Tweets

957 Followers

1,3K Following

Jacob Pfau(@jacob_pfau) 's Twitter Profile Photo

Do models need to reason in words to benefit from chain-of-thought tokens?

In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens.
This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧡

Do models need to reason in words to benefit from chain-of-thought tokens? In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens. This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧡
account_circle
Usman Anwar(@usmananwar391) 's Twitter Profile Photo

We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges.

My co-authors have posted tweets for each of these challenges. I am going to collect them all here!

P.S. this is also now on arxiv: arxiv.org/abs/2404.09932

account_circle
Davis Brown(@davisbrownr) 's Twitter Profile Photo

🚨 Come work with PNNL on AI Safety and Security! These are unique roles working on safety for a DOE national laboratory's national security mission. Applications close April 10th (this Wednesday), some detail and rolls below in 🧡

🚨 Come work with PNNL on AI Safety and Security! These are unique roles working on safety for a DOE national laboratory's national security mission. Applications close April 10th (this Wednesday), some detail and rolls below in 🧡
account_circle
Cas (Stephen Casper)(@StephenLCasper) 's Twitter Profile Photo

🚨 New paper: Defending Against Unforeseen Failure Modes with Latent Adversarial Training

We argue that LAT can be a key tool for safer AI because it can help address the gap between failure modes that developers identify 🎯 and ones they miss πŸ€”.

arxiv.org/abs/2403.05030

🚨 New paper: Defending Against Unforeseen Failure Modes with Latent Adversarial Training We argue that LAT can be a key tool for safer AI because it can help address the gap between failure modes that developers identify 🎯 and ones they miss πŸ€”. arxiv.org/abs/2403.05030
account_circle
Sam Bowman(@sleepinyourhat) 's Twitter Profile Photo

πŸš¨πŸ“„ Following up on 'LMs Don't Always Say What They Think', Miles Turpin et al. now have an intervention that dramatically reduces the problem! πŸ“„πŸš¨

It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.

account_circle
Ethan Perez(@EthanJPerez) 's Twitter Profile Photo

Excited about our latest work: we found a way to train LLMs to produce reasoning that's more faithful to how LLMs solve tasks.

We did so by training LLMs to give reasoning that's consistent across inputs, and I suspect the approach here might be useful even beyond faithfulness

account_circle