Chris Glaze (@chris_m_glaze) Twitter Tweets • TwiCopy

Chris Glaze

8 months ago

Definitely tracks with our observations. Point (3) especially begs for how to most effectively involve experts in the dev process.

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

The term “horizon” is used inconsistently in the LM benchmarking lit, and not always super aligned with how the term is used in the RL lit. TL;DR: long horizon ≠ complex. Frontier models may do well on complex tasks but can still fail on basics of long horizon planning. METR

thumb_up_off_alt12

chat_bubble_outline0

repeat3

shareShare

Chris Glaze

@chris_m_glaze

8 months ago

AI eval:basic software testing :: AI capability:basic software capability. AI evals are difficult to get right (is probably why we see <<80% agreement rates often). Requires critical thinking about the problem domain, usability, objectives. Rubrics are vital and their

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Chris Glaze

@chris_m_glaze

7 months ago

Black holes just proved Stephen Hawking right with the clearest signal yet sciencedaily.com/releases/2025/…

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Fred Sala

@fredsala

7 months ago

Super excited to present our new work on hybrid architecture models—getting the best of Transformers and SSMs like Mamba—at #COLM2025! Come chat with Nicholas Roberts at poster session 2 on Tuesday. Thread below! (1)

thumb_up_off_alt71

chat_bubble_outline2

repeat25

shareShare

Chris Glaze

@chris_m_glaze

6 months ago

Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at Snorkel AI we’re making “code-only” versions of environments in which we challenge agents to solve

thumb_up_off_alt29

chat_bubble_outline1

repeat5

shareShare

Percy Liang

@percyliang

6 months ago

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

thumb_up_off_alt562

chat_bubble_outline19

repeat84

shareShare

Amanda Dsouza

@amanda_dsouza

6 months ago

🚨 New research from Snorkel AI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊 We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers. BeTaL produces

🚨 New research from <a href="/SnorkelAI/">Snorkel AI</a> tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊

We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers.

BeTaL produces

thumb_up_off_alt25

chat_bubble_outline4

repeat12

shareShare

Snorkel AI

@snorkelai

6 months ago

New from Armin : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.

New from <a href="/ArminPCM/">Armin</a> : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.

thumb_up_off_alt21

chat_bubble_outline1

repeat5

shareShare

Mayank Vora

@aiwithmayank

5 months ago

Chain-of-thought just became the newest safety nightmare in AI, and nobody was ready for this. A team from Anthropic, Stanford, and Oxford found something brutal: if you wrap a harmful request inside a long, harmless reasoning chain, the model’s guardrails weaken until it stops

thumb_up_off_alt339

chat_bubble_outline29

repeat88

shareShare

Chris Glaze

@chris_m_glaze

5 months ago

What’s the future-state of AI agents in real-world scenarios? How often will they just solve problems as coders vs interacting with more constrained but complex tool sets? Models like Claude Sonnet 4.5 are indeed impressive at coding and can in theory use these same skills to

thumb_up_off_alt25

chat_bubble_outline1

repeat3

shareShare

Snorkel AI

@snorkelai

5 months ago

SnorkelAI is headed to #NeurIPS2025 this December with Fred Sala and the team. Come talk benchmarks, rubrics, RL envs, & more. 🔬✨

SnorkelAI is headed to #NeurIPS2025 this December with <a href="/fredsala/">Fred Sala</a> and the team.

Come talk benchmarks, rubrics, RL envs, & more. 🔬✨

thumb_up_off_alt25

chat_bubble_outline0

repeat6

shareShare

Chris Glaze

@chris_m_glaze

5 months ago

Confirming task solvability is the very first thing to do after developing a benchmark. This is especially challenging in tau-bench style envs, and at least half of our dev work at Snorkel AI is devoted to that when we make these envs for customers. If you make many unique tasks

thumb_up_off_alt39

chat_bubble_outline1

repeat4

shareShare

Chris Glaze

@chris_m_glaze

4 months ago

Frontier models like Gemini 3 Pro making impressive strides as code agents, still showing basic errors in real world tasks though when applying coding skills to solve enterprise-style problems. We took the verified version of Tau^2 Bench made by the AGI team at Amazon and

thumb_up_off_alt35

chat_bubble_outline3

repeat6

shareShare

vincent sunn chen

@vincentsunnchen

3 months ago

x.com/i/article/2021…

thumb_up_off_alt313

chat_bubble_outline16

repeat78

shareShare

Snorkel AI

@snorkelai

2 months ago

Snorkel contributed: • An agentic RL eval environment • FinQA-Reasoning dataset • Finance Reasoning benchmark Full technical breakdown from the rLLM team + our enterprise takeaways here: snorkel.ai/blog/how-tool-…

thumb_up_off_alt10

chat_bubble_outline0

repeat2

shareShare

Chris Glaze

@chris_m_glaze

2 months ago

Was truly a pleasure here partnering with Sijun Tan and mananroongta at rLLM . Really quite excellent work from them and a very nice library they've built.

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare

rLLM

@rllm_project

2 months ago

x.com/i/article/2017…

thumb_up_off_alt34

chat_bubble_outline1

repeat9

shareShare