Miles Turpin (@milesaturpin) 's Twitter Profile
Miles Turpin

@milesaturpin

LLM safety research, SEAL team @scale_AI. Previously alignment research @nyuniversity, early employee @cohere

ID: 865609028579213312

linkhttp://milesturp.in/about calendar_today19-05-2017 16:44:09

371 Tweet

1,1K Followers

1,1K Following

Usman Anwar (@usmananwar391) 's Twitter Profile Photo

We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges. My co-authors have posted tweets for each of these challenges. I am going to collect them all here! P.S. this is also now on arxiv: arxiv.org/abs/2404.09932

Jacob Pfau (@jacob_pfau) 's Twitter Profile Photo

Do models need to reason in words to benefit from chain-of-thought tokens? In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens. This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵

Do models need to reason in words to benefit from chain-of-thought tokens?

In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens. 
This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵
david rein (@idavidrein) 's Twitter Profile Photo

Is GPQA garbage? A couple weeks ago, typedfemale pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.

Is GPQA garbage?

A couple weeks ago, <a href="/typedfemale/">typedfemale</a> pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.
Summer Yue (@summeryue0) 's Twitter Profile Photo

🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at scale.com/leaderboard! Which evals should we build next?

🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! 

Check out our leaderboards at scale.com/leaderboard! 

Which evals should we build next?
Nathaniel Li (@natliml) 's Twitter Profile Photo

Who's better at LLM mischief — humans or AIs? Spoiler: It's us. Human red teamers achieve 70%+ attack success rates against LLM defenses that stump automated adversarial attacks. Why? We’re better at adversarial yapping.🧵

Who's better at LLM mischief — humans or AIs? Spoiler: It's us.

Human red teamers achieve 70%+ attack success rates against LLM defenses that stump automated adversarial attacks. Why? We’re better at adversarial yapping.🧵