@yoavgo : (a) how did MMLU become the defacto standard benchmark every LLM is trying to beat? (b) it is estimated to contain 9% questions that human experts think are wrong. do we know if humans and models agree on which ones belong in this 9%? • TwiCopy

(((ل()(ل() 'yoav))))👾

@yoavgo

+ Follow

ID: 39547749

linkhttp://www.cs.biu.ac.il/~yogo/ calendar_today12-05-2009 17:26:34

133,133K Tweet

59,59K Takipçi

2,2K Takip Edilen

(((ل()(ل() 'yoav))))👾

@yoavgo

8 months ago

(a) how did MMLU become the defacto standard benchmark every LLM is trying to beat? (b) it is estimated to contain 9% questions that human experts think are wrong. do we know if humans and models agree on which ones belong in this 9%?

thumb_up_off_alt28

chat_bubble_outline3

repeat2

shareShare