Kobie Crawford (@kobiewon) 's Twitter Profile
Kobie Crawford

@kobiewon

Community @ MosaicML. Opinions are my own.

ID: 81485627

linkhttps://github.com/mosaicml/composer calendar_today11-10-2009 01:28:49

95 Tweet

119 Followers

60 Following

Snorkel AI (@snorkelai) 's Twitter Profile Photo

Static benchmarks can’t keep up with the pace of AI progress. Our latest research introduces BeTaL—Benchmark Tuning with an LLM-in-the-loop—a framework that uses reasoning models to optimize benchmark design dynamically. ✍️ From the Snorkel Research team: Amanda Dsouza ,

Static benchmarks can’t keep up with the pace of AI progress.

Our latest research introduces BeTaL—Benchmark Tuning with an LLM-in-the-loop—a framework that uses reasoning models to optimize benchmark design dynamically.

✍️ From the Snorkel Research team: <a href="/amanda_dsouza/">Amanda Dsouza</a> ,
Snorkel AI (@snorkelai) 's Twitter Profile Photo

🖥️ Hackathon vibes at AGI House yesterday! Zhengyang Qi joined builders to explore self-evolving AI agents and shared insights from Snorkel’s latest paper, “Automating Benchmark Design.” Big thanks to Weights & Biases , CoreWeave, and Firecrawl for powering an

🖥️ Hackathon vibes at <a href="/agihouse_org/">AGI House</a>  yesterday! 
<a href="/qi_zhengyang/">Zhengyang Qi</a>  joined builders to explore self-evolving AI agents and shared insights from Snorkel’s latest paper, “Automating Benchmark Design.” 

Big thanks to Weights &amp; Biases , <a href="/CoreWeave/">CoreWeave</a>, and <a href="/firecrawl_dev/">Firecrawl</a> for powering an
Snorkel AI (@snorkelai) 's Twitter Profile Photo

New from Armin : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.

New from <a href="/ArminPCM/">Armin</a> : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.
Kobie Crawford (@kobiewon) 's Twitter Profile Photo

Super excited about this! As if TBench2.0 weren't enough, they also drop Harbor -- easy scaling framework, quick look suggests it's compatible with many infra options. Can't wait to play with it!

Snorkel AI (@snorkelai) 's Twitter Profile Photo

Gratitude to the veterans whose service inspires dedication, courage, and teamwork—values we strive for in our lab every day. 🇺🇸

Gratitude to the veterans whose service inspires dedication, courage, and teamwork—values we strive for in our lab every day. 🇺🇸
Mayank Vora (@aiwithmayank) 's Twitter Profile Photo

Chain-of-thought just became the newest safety nightmare in AI, and nobody was ready for this. A team from Anthropic, Stanford, and Oxford found something brutal: if you wrap a harmful request inside a long, harmless reasoning chain, the model’s guardrails weaken until it stops

Chain-of-thought just became the newest safety nightmare in AI, and nobody was ready for this.

A team from Anthropic, Stanford, and Oxford found something brutal: if you wrap a harmful request inside a long, harmless reasoning chain, the model’s guardrails weaken until it stops
Anthropic (@anthropicai) 's Twitter Profile Photo

We disrupted a highly sophisticated AI-led espionage campaign. The attack targeted large tech companies, financial institutions, chemical manufacturing companies, and government agencies. We assess with high confidence that the threat actor was a Chinese state-sponsored group.

Alex Ratner (@ajratner) 's Twitter Profile Photo

AI environment development efforts today fall into one of two main buckets: *app-centric* and *task-centric*. App-centric is anchored on building a clone of an application e.g. common website or tool. This is usually the main or only artifact sold. Task-centric development

Snorkel AI (@snorkelai) 's Twitter Profile Photo

We had a terrific interview with the creators of Terminal Bench 2.0. They unpack: • why terminals → more reliable and powerful agents • key design tradeoffs in TB 2.0 • Creating Harbor to enable eval, RL, and agent workflows at scale • lessons from building a 100+

We had a terrific interview with the creators of Terminal Bench 2.0.

They unpack:
• why terminals → more reliable and powerful agents
• key design tradeoffs in TB 2.0
• Creating Harbor to enable eval, RL, and agent workflows at scale
• lessons from building a 100+
Kobie Crawford (@kobiewon) 's Twitter Profile Photo

Been working with Fred Sala since August -- now I get to meet him IRL for the first time! Looking forward to connecting with current and former colleagues this week #NeurIPS2025

Justin Bauer (@realjustinbauer) 's Twitter Profile Photo

Part 5 of the Snorkel rubrics series is live. We dive into the future of agentic, multimodal, tool-using AI and the alignment power of dynamic rubrics. Full blog here: snorkel.ai/blog/part-v-fu…