Chris Glaze (@chris_m_glaze) 's Twitter Profile
Chris Glaze

@chris_m_glaze

Principal Research Scientist at @SnorkelAI. PhD in computational neuroscience. scholar.google.com/citations?user…

ID: 3873101457

calendar_today05-10-2015 17:53:17

8 Tweet

15 Followers

29 Following

Chris Glaze (@chris_m_glaze) 's Twitter Profile Photo

Definitely tracks with our observations. Point (3) especially begs for how to most effectively involve experts in the dev process.

Chris Glaze (@chris_m_glaze) 's Twitter Profile Photo

The term “horizon” is used inconsistently in the LM benchmarking lit, and not always super aligned with how the term is used in the RL lit. TL;DR: long horizon ≠ complex. Frontier models may do well on complex tasks but can still fail on basics of long horizon planning. METR

The term “horizon” is used inconsistently in the LM benchmarking lit, and not always super aligned with how the term is used in the RL lit. TL;DR: long horizon ≠ complex. Frontier models may do well on complex tasks but can still fail on basics of long horizon planning.

METR
Chris Glaze (@chris_m_glaze) 's Twitter Profile Photo

AI eval:basic software testing :: AI capability:basic software capability. AI evals are difficult to get right (is probably why we see <<80% agreement rates often). Requires critical thinking about the problem domain, usability, objectives. Rubrics are vital and their

Fred Sala (@fredsala) 's Twitter Profile Photo

Super excited to present our new work on hybrid architecture models—getting the best of Transformers and SSMs like Mamba—at #COLM2025! Come chat with Nicholas Roberts at poster session 2 on Tuesday. Thread below! (1)

Super excited to present our new work on hybrid architecture models—getting the best of Transformers and SSMs like Mamba—at #COLM2025! Come chat with <a href="/nick11roberts/">Nicholas Roberts</a> at poster session 2 on Tuesday. Thread below! (1)
Chris Glaze (@chris_m_glaze) 's Twitter Profile Photo

Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at Snorkel AI we’re making “code-only” versions of environments in which we challenge agents to solve

Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at <a href="/SnorkelAI/">Snorkel AI</a>  we’re making “code-only” versions of environments in which we challenge agents to solve
Percy Liang (@percyliang) 's Twitter Profile Photo

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

⛵Marin 32B Base (mantis) is done training!  It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base.  Ranking across 19 benchmarks:
Amanda Dsouza (@amanda_dsouza) 's Twitter Profile Photo

🚨 New research from Snorkel AI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊 We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers. BeTaL produces

🚨 New research from <a href="/SnorkelAI/">Snorkel AI</a> tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊

We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers.

BeTaL produces
Snorkel AI (@snorkelai) 's Twitter Profile Photo

New from Armin : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.

New from <a href="/ArminPCM/">Armin</a> : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.
Mayank Vora (@aiwithmayank) 's Twitter Profile Photo

Chain-of-thought just became the newest safety nightmare in AI, and nobody was ready for this. A team from Anthropic, Stanford, and Oxford found something brutal: if you wrap a harmful request inside a long, harmless reasoning chain, the model’s guardrails weaken until it stops

Chain-of-thought just became the newest safety nightmare in AI, and nobody was ready for this.

A team from Anthropic, Stanford, and Oxford found something brutal: if you wrap a harmful request inside a long, harmless reasoning chain, the model’s guardrails weaken until it stops
Chris Glaze (@chris_m_glaze) 's Twitter Profile Photo

What’s the future-state of AI agents in real-world scenarios? How often will they just solve problems as coders vs interacting with more constrained but complex tool sets? Models like Claude Sonnet 4.5 are indeed impressive at coding and can in theory use these same skills to

What’s the future-state of AI agents in real-world scenarios? How often will they just solve problems as coders vs interacting with more constrained but complex tool sets? Models like Claude Sonnet 4.5 are indeed impressive at coding and can in theory use these same skills to
Chris Glaze (@chris_m_glaze) 's Twitter Profile Photo

Confirming task solvability is the very first thing to do after developing a benchmark. This is especially challenging in tau-bench style envs, and at least half of our dev work at Snorkel AI is devoted to that when we make these envs for customers. If you make many unique tasks

Chris Glaze (@chris_m_glaze) 's Twitter Profile Photo

Frontier models like Gemini 3 Pro making impressive strides as code agents, still showing basic errors in real world tasks though when applying coding skills to solve enterprise-style problems. We took the verified version of Tau^2 Bench made by the AGI team at Amazon and

Frontier models like Gemini 3 Pro making impressive strides as code agents, still showing basic errors in real world tasks though when applying coding skills to solve enterprise-style problems.

We took the verified version of Tau^2 Bench made by the AGI team at <a href="/amazon/">Amazon</a> and
Snorkel AI (@snorkelai) 's Twitter Profile Photo

Snorkel contributed: • An agentic RL eval environment • FinQA-Reasoning dataset • Finance Reasoning benchmark Full technical breakdown from the rLLM team + our enterprise takeaways here: snorkel.ai/blog/how-tool-…