Phoebe Thacker
@phoebethacker
Human Data @OpenAI, Ex Data @ Google DeepMind
ID: 95918639
10-12-2009 15:06:48
470 Tweet
426 Followers
780 Following
Introducing SWE-Lancer: our most realistic coding benchmark to date. $1M in real-world, full-stack freelance SWE tasks, each taking freelancers >21 days to complete on avg. Still some limitations, but better than evals we had before. Congrats Samuel Miserendino Michele Wang!