Miles Turpin (@milesaturpin) Twitter Tweets • TwiCopy

Miles Turpin

@milesaturpin

+ Follow

LLM safety research, SEAL team @scale_AI. Previously alignment research @nyuniversity, early employee @cohere

ID: 865609028579213312

linkhttp://milesturp.in/about calendar_today19-05-2017 16:44:09

371 Tweet

1,1K Followers

1,1K Following

Usman Anwar

@usmananwar391

5 months ago

We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges. My co-authors have posted tweets for each of these challenges. I am going to collect them all here! P.S. this is also now on arxiv: arxiv.org/abs/2404.09932

Jacob Pfau

@jacob_pfau

5 months ago

Do models need to reason in words to benefit from chain-of-thought tokens? In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens. This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵

thumb_up_off_alt1,1K

chat_bubble_outline42

repeat185

shareShare

david rein

@idavidrein

4 months ago

Is GPQA garbage? A couple weeks ago, typedfemale pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.

Is GPQA garbage?

A couple weeks ago, <a href="/typedfemale/">typedfemale</a> pointed out some mistakes in a GPQA question, so I figured this would be a good opportunity to discuss how we interpret benchmark scores, and what our goals should be when creating benchmarks.

Summer Yue

@summeryue0

4 months ago

🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at scale.com/leaderboard! Which evals should we build next?

thumb_up_off_alt194

chat_bubble_outline10

repeat33

shareShare

Nathaniel Li

@natliml

a month ago

Who's better at LLM mischief — humans or AIs? Spoiler: It's us. Human red teamers achieve 70%+ attack success rates against LLM defenses that stump automated adversarial attacks. Why? We’re better at adversarial yapping.🧵