š How do you choose which language model to use? Quantitative benchmarks can be uninformative and fall prey to Goodhart's Law, and even Chatbot Arena performance can be optimized for.
In our new preprint, we propose generating qualitative report cards... š§µ