Zhengyang Qi
@qi_zhengyang
ID: 1649567079320768518
22-04-2023 00:13:40
7 Tweet
26 Followers
171 Following
The coolest trend for AI is shifting from conversation to action—less talking and more doing. This is also a great opportunity for evals: we need benchmarks that measure utility, including in an economic sense. terminalbench is my favorite effort of this type!
🚨 New research from Snorkel AI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊 We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers. BeTaL produces
NeurIPS lunch crew → Snorkel researchers + the always-great Tom Walshe If you’re at #NeurIPS2025, come say hi — and see everything else we’re doing this week (papers, workshops, events): snorkel.ai/neurips-event/
Exciting mention of TBench 2.0 in today's model releases - congrats to Mike A. Merrill Alex Shaw & team + proud of Snorkel AI 's contributions! Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!