Clayton Thorrez (@cthorrez) 's Twitter Profile
Clayton Thorrez

@cthorrez

Archived account, now active on bluesky :)

LLM applied scientist by day, esports data scientist for fun.

Evaluating rating systems on EsportsBench using riix

ID: 707043589289693185

linkhttps://huggingface.co/datasets/EsportsBench/EsportsBench calendar_today08-03-2016 03:21:42

2,2K Tweet

1,1K Takipçi

1,1K Takip Edilen

Clayton Thorrez (@cthorrez) 's Twitter Profile Photo

Are rating system predictive accuracies stable over time? Pretty much yea Using hyperparams swept on data up to 2023-03-31, I evaluated on 20 esports datasets in 3 month chunks. Relative ordering is 100% stable, Glicko>TrueSkill>Elo

Are rating system predictive accuracies stable over time? Pretty much yea

Using hyperparams swept on data up to 2023-03-31, I evaluated on 20 esports datasets in 3 month chunks. Relative ordering is 100% stable, Glicko>TrueSkill>Elo
Clayton Thorrez (@cthorrez) 's Twitter Profile Photo

Claude agrees with me that it's weird that Glicko only uses opponent variance to compute win prob for a player (not their own). And thus P(A beats B) != 1 - P(B beats A) (I compute both, use them in their respective updates then average)

Claude agrees with me that it's weird that Glicko only uses opponent variance to compute win prob for a player (not their own). And thus P(A beats B) != 1 - P(B beats A)

(I compute both, use them in their respective updates then average)
n i k o l a e s t h e t i c (@nikolaesthetic) 's Twitter Profile Photo

My gf agreed we can name our dog Jokic if I wake up with 20k followers. She set an impossible condition because she really doesn't wanna name him/her Jokic. I don't ask for much, but I ask for this miracle.

Clayton Thorrez (@cthorrez) 's Twitter Profile Photo

It's really a bad practice to use Elo as an evaluation metric for LLMs. I can't believe I'm still seeing papers in good venues doing it Elo assumes the strengths are time varying, you need to use a Bradley-Terry or similar model to aggregate paired comparisons in this setting