Zhengyang Qi (@qi_zhengyang) 's Twitter Profile
Zhengyang Qi

@qi_zhengyang

ID: 1649567079320768518

calendar_today22-04-2023 00:13:40

7 Tweet

26 Followers

171 Following

Zhengyang Qi (@qi_zhengyang) 's Twitter Profile Photo

Just revalidated the saying that the more overwhelmed you feel designing the systems, the smoother your developing experience will be.

Kolby Nottingham (@kolbytn) 's Twitter Profile Photo

Excited to share our work, "Skill Set Optimization", a continual learning method for LLM actors that: - Automatically extracts modular subgoals to use as skills - Reinforces skills using environment reward - Facilitates skill retrieval based on state allenai.github.io/sso 🧵

Excited to share our work, "Skill Set Optimization", a continual learning method for LLM actors that:
- Automatically extracts modular subgoals to use as skills
- Reinforces skills using environment reward
- Facilitates skill retrieval based on state
allenai.github.io/sso
🧵
Ruiyi Wang 王睿仪 🦋 @ruiyiwang (@ruiyiwang153) 's Twitter Profile Photo

Happy PI🥧 day! Can language agents🤖 learn social skills through imitation and interaction? We are excited to introduce SOTOPIA-π🥧 pi.sotopia.world, an interactive learning method for training language agents to navigate real-world social scenarios while role-playing!

Hao Zhu 朱昊 (@_hao_zhu) 's Twitter Profile Photo

Another update for Sotopia: I coded a simple Colab tutorial for you to try out Sotopia simulation and evaluation without setting it up on your own machine: colab.research.google.com/drive/14hJOfzp… Create your own characters + social scenarios, and use your favorite LLMs!

Ruiyi Wang 王睿仪 🦋 @ruiyiwang (@ruiyiwang153) 's Twitter Profile Photo

Excited to share that SOTOPIA-π was accepted to #ACL2024 main conference 🎉! Check our work on training socially intelligent language agents: arxiv.org/abs/2403.08715

Haofei Yu 🦋 @haofeiyu.bsky.social (@haofeiyu44) 's Twitter Profile Photo

🚀 RL works wonders for math & coding... But what about social intelligence? We introduce Sotopia-RL: Reward Design for Social Intelligence 🤝🧠 📘 rl.sotopia.world 📑 arxiv.org/abs/2508.03905 🔗 github.com/sotopia-lab/so…

Bing Liu (@vbingliu) 's Twitter Profile Photo

New @Scale_AI paper! The culprit behind reward hacking? We trace it to misspecification in high-reward tail. Our fix: rubric-based rewards to tell “excellent” responses apart from “great.” The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500

New @Scale_AI paper!

The culprit behind reward hacking? We trace it to misspecification in high-reward tail.

Our fix: rubric-based rewards to tell “excellent” responses apart from “great.”

The result: Less hacking, stronger post-training!  arxiv.org/pdf/2509.21500
Fred Sala (@fredsala) 's Twitter Profile Photo

The coolest trend for AI is shifting from conversation to action—less talking and more doing. This is also a great opportunity for evals: we need benchmarks that measure utility, including in an economic sense. terminalbench is my favorite effort of this type!

Zhengyang Qi (@qi_zhengyang) 's Twitter Profile Photo

Excited to be speaking at AGI House’s Self-Evolving Agents Build Day this weekend! Looking forward to connecting with other builders and researchers working at the edge of adaptive and self-evolving agents.

Amanda Dsouza (@amanda_dsouza) 's Twitter Profile Photo

🚨 New research from Snorkel AI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊 We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers. BeTaL produces

🚨 New research from <a href="/SnorkelAI/">Snorkel AI</a> tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊

We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers.

BeTaL produces
Armin (@arminpcm) 's Twitter Profile Photo

New blog post: “Snorkeling in RL Environments” What makes a great RL environment for LLMs? We break down what’s needed to give agentic apps the reward signal they need to develop production-ready accuracy.

New blog post: “Snorkeling in RL Environments”

What makes a great RL environment for LLMs? We break down what’s needed to give agentic apps the reward signal they need to develop production-ready accuracy.
Snorkel AI (@snorkelai) 's Twitter Profile Photo

NeurIPS lunch crew → Snorkel researchers + the always-great Tom Walshe If you’re at #NeurIPS2025, come say hi — and see everything else we’re doing this week (papers, workshops, events): snorkel.ai/neurips-event/

NeurIPS lunch crew → Snorkel researchers + the always-great <a href="/Walshe_tech/">Tom Walshe</a>  

If you’re at #NeurIPS2025, come say hi — and see everything else we’re doing this week (papers, workshops, events):  snorkel.ai/neurips-event/
Justin Bauer (@realjustinbauer) 's Twitter Profile Photo

Our paper “Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes” was accepted to #MLSys 2026! We introduce three procedurally generated, verifiable datasets—Counting, Graph, and Spatial Reasoning—to study RLVR under low-data / low-compute

Alex Ratner (@ajratner) 's Twitter Profile Photo

Exciting mention of TBench 2.0 in today's model releases - congrats to Mike A. Merrill Alex Shaw & team + proud of Snorkel AI 's contributions! Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!

Exciting mention of TBench 2.0 in today's model releases - congrats to <a href="/Mike_A_Merrill/">Mike A. Merrill</a> <a href="/alexgshaw/">Alex Shaw</a> &amp; team + proud of <a href="/SnorkelAI/">Snorkel AI</a> 's contributions!

Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!
Zhengyang Qi (@qi_zhengyang) 's Twitter Profile Photo

Snorkel is launching a cool program for funding open benchmark and evaluation research. If you have cool benchmark ideas don’t hesitate to contact us and we can make it happen!