Hugh Zhang (@hughbzhang) 's Twitter Profile
Hugh Zhang

@hughbzhang

research @scale_AI. co-created @gradientpub.

ID: 1075466686709395457

linkhttp://hughbzhang.com calendar_today19-12-2018 19:03:34

420 Tweet

3,3K Takipçi

1,1K Takip Edilen

Hugh Zhang (@hughbzhang) 's Twitter Profile Photo

Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k.

Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k.