@hughbzhang : Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k. • TwiCopy

Hugh Zhang

@hughbzhang

+ Follow

research @scale_AI. co-created @gradientpub.

ID: 1075466686709395457

linkhttp://hughbzhang.com calendar_today19-12-2018 19:03:34

420 Tweet

3,3K Takipçi

1,1K Takip Edilen

Hugh Zhang

@hughbzhang

a year ago

Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k.

thumb_up_off_alt1,1K

chat_bubble_outline36

repeat223

shareShare