Dataset curation for language models has long relied on brittle, hand-crafted rules. It's time for a more principled, automated approach.
Enter DataRater: a meta-learning framework that learns to value data based on downstream training efficiency. Great summary by Luisa below 👇