Sebfox (@sebfox1) Twitter Tweets • TwiCopy

Sebfox

@sebfox1

+ Follow

Building AI evals | Previously AI at McKinsey & QuantumBlack

ID: 81218132

linkhttp://composo.ai calendar_today09-10-2009 22:26:55

260 Tweet

200 Followers

622 Following

Theo - t3.gg

7 months ago

The best decision OpenAI made with Sora was making it iOS only. Android users lack taste and would ruin the whole thing

thumb_up_off_alt4,4K

chat_bubble_outline458

repeat94

shareShare

signüll

7 months ago

anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it. that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5

anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it.

that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5

thumb_up_off_alt4,4K

chat_bubble_outline149

repeat189

shareShare

Amjad Masad

7 months ago

It’s unacceptable and unprofessional for a VC like Josh Wolfe to attack a founder by altering screenshots to put words in my mouth and falsely tying my company and partners to them. If this is what he does publicly, imagine the distortions spread behind closed doors. I remain

thumb_up_off_alt6,6K

chat_bubble_outline172

repeat570

shareShare

kitze ⛴️

7 months ago

having a small meeting with all the ppl who have done something useful with mcp

having a small meeting with all the ppl who have done something useful with mcp

thumb_up_off_alt2,2K

chat_bubble_outline106

repeat69

shareShare

Sebfox

7 months ago

Watched an interesting standup yesterday at a customer's office. Engineer: "Pushed the new retrieval logic last night." PM: "How's it performing?" Engineer: pulls up dashboard "Hallucination rate down 15%, relevance holding steady at 0.84." PM: "Ship it to the next tier."

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Sebfox

7 months ago

Shipping AI agents to production just got a lot easier. Most AI teams are stuck in one of two places: - Flying blind—shipping and praying, finding out what broke from customer complaints. - Or stuck in beta—spending weeks on evals that still don't give them confidence to ship.

thumb_up_off_alt5

chat_bubble_outline1

repeat0

shareShare

Sebfox

7 months ago

3 lines. That's all it takes to go from opaque AI agents to full observability and evaluation. Most agent frameworks abstract away what's actually happening under the hood. You deploy to production, something breaks, and you're left debugging in the dark with no idea which agent

3 lines. That's all it takes to go from opaque AI agents to full observability and evaluation.

Most agent frameworks abstract away what's actually happening under the hood. You deploy to production, something breaks, and you're left debugging in the dark with no idea which agent

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Sebfox

7 months ago

Beyond Real-Time Evaluation: Visualize Your AI Performance at Scale Once you've traced your agents with Composo, the real power comes from visualization. While you get instant evaluation results directly in your code, the Composo platform takes it further with comprehensive

Beyond Real-Time Evaluation: Visualize Your AI Performance at Scale

Once you've traced your agents with Composo, the real power comes from visualization.

While you get instant evaluation results directly in your code, the Composo platform takes it further with comprehensive

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Sebfox

7 months ago

Sat in on a deployment review. The phrase "seems fine" came up twelve times in thirty minutes. "The outputs seem fine" "Performance seems fine" "Customers seem fine with it" This is a billion-dollar company's evaluation strategy. Seems. Fine. The engineer presenting looked

thumb_up_off_alt2

chat_bubble_outline2

repeat0

shareShare

Sebfox

7 months ago

Evals don't have to feel like a complex chore, i've seen many teams have huge success with a few simple evals. The real danger comes from overthinking, not shipping & spending more time on evals than the feature itself.

Evals don't have to feel like a complex chore, i've seen many teams have huge success with a few simple evals.

The real danger comes from overthinking, not shipping & spending more time on evals than the feature itself.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Sebfox

6 months ago

Team A: "We need another week to validate this change. Sarah from the medical team is on vacation and she's the only one who can check the healthcare responses." Team B: "Medical accuracy is at 0.91, up from 0.87 last week. The failing cases are all around dosage

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Sebfox

6 months ago

This vibe shifts hall of fame in state of ai 2025 are great

This vibe shifts hall of fame in state of ai 2025 are great

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Sebfox

6 months ago

The metric nobody tracks: Time between "I wonder if this will work" and "I know if this worked." For most AI teams: 2-3 weeks For the best teams: 2-3 hours This isn't about moving fast and breaking things. It's about moving fast and knowing exactly what you're doing. I've

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Sebfox

6 months ago

Listened to a team retro yesterday. They were reflecting on how much had changed in six months. "Remember when we used to argue for hours about whether the new prompt was 'better'?" Now their deployment decisions take minutes. Not because they care less about quality – because

thumb_up_off_alt5

chat_bubble_outline1

repeat0

shareShare

Andrew Ng

6 months ago

Readers responded with both surprise and agreement last week when I wrote that the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error

thumb_up_off_alt1,1K

chat_bubble_outline84

repeat290

shareShare

Sebfox

6 months ago

Every AI conference: "Evals are critical. You need comprehensive datasets. Build robust testing frameworks." Every AI team in practice: "Does it look okay? Ship it." Look, I know evals are critical – I've literally dedicated my life to building them. But they're also boring,

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Sebfox

6 months ago

Why your engineering team hates evals - Engineers have been shipping software successfully for decades with a simple toolkit: Unit tests CI/CD pipelines Production monitoring Then AI shows up and suddenly everyone's telling them to build datasets, craft judge prompts, and

thumb_up_off_alt7

chat_bubble_outline1

repeat0

shareShare

Sebfox

6 months ago

"We test in production" used to be a joke. Now it's the strategy. Not because teams are reckless. Because that's where the real signal lives. Your carefully crafted evaluation dataset will never capture the weird stuff actual users do. Your judge prompts won't predict which

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Sebfox

6 months ago

There's a bit of a skill set mismatch going on. We're asking engineers to become data scientists. "Just analyze your traces" "Build comprehensive datasets" "Craft better judge prompts" But engineers don't live in notebooks and spreadsheets. They live in IDEs and terminals.

thumb_up_off_alt1

chat_bubble_outline2

repeat0

shareShare

Sebfox

6 months ago

The vibes-to-metrics evolution: Stage 1: Ship on vibes Stage 2: Users complain in Discord Stage 3: Add basic monitoring Stage 4: Find the one metric that matters Stage 5: Optimize that metric relentlessly Every successful AI product I've studied followed this path. Not

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare