Sebfox (@sebfox1) 's Twitter Profile
Sebfox

@sebfox1

Building AI evals | Previously AI at McKinsey & QuantumBlack

ID: 81218132

linkhttp://composo.ai calendar_today09-10-2009 22:26:55

260 Tweet

200 Followers

622 Following

signüll (@signulll) 's Twitter Profile Photo

anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it. that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5

anthropic’s thinking campaign is just so damn tasteful… it feels like a warm room. the aesthetic is comforting as hell, i kinda get a sense of claude as ai helping you be you from it.

that’s why we’re weaving claude into everything we build. i’ll post more about sonnet 4.5
Amjad Masad (@amasad) 's Twitter Profile Photo

It’s unacceptable and unprofessional for a VC like Josh Wolfe to attack a founder by altering screenshots to put words in my mouth and falsely tying my company and partners to them. If this is what he does publicly, imagine the distortions spread behind closed doors. I remain

Sebfox (@sebfox1) 's Twitter Profile Photo

Watched an interesting standup yesterday at a customer's office. Engineer: "Pushed the new retrieval logic last night." PM: "How's it performing?" Engineer: pulls up dashboard "Hallucination rate down 15%, relevance holding steady at 0.84." PM: "Ship it to the next tier."

Sebfox (@sebfox1) 's Twitter Profile Photo

Shipping AI agents to production just got a lot easier. Most AI teams are stuck in one of two places: - Flying blind—shipping and praying, finding out what broke from customer complaints. - Or stuck in beta—spending weeks on evals that still don't give them confidence to ship.

Sebfox (@sebfox1) 's Twitter Profile Photo

3 lines. That's all it takes to go from opaque AI agents to full observability and evaluation. Most agent frameworks abstract away what's actually happening under the hood. You deploy to production, something breaks, and you're left debugging in the dark with no idea which agent

3 lines. That's all it takes to go from opaque AI agents to full observability and evaluation.

Most agent frameworks abstract away what's actually happening under the hood. You deploy to production, something breaks, and you're left debugging in the dark with no idea which agent
Sebfox (@sebfox1) 's Twitter Profile Photo

Beyond Real-Time Evaluation: Visualize Your AI Performance at Scale Once you've traced your agents with Composo, the real power comes from visualization. While you get instant evaluation results directly in your code, the Composo platform takes it further with comprehensive

Beyond Real-Time Evaluation: Visualize Your AI Performance at Scale

Once you've traced your agents with Composo, the real power comes from visualization. 

While you get instant evaluation results directly in your code, the Composo platform takes it further with comprehensive
Sebfox (@sebfox1) 's Twitter Profile Photo

Sat in on a deployment review. The phrase "seems fine" came up twelve times in thirty minutes. "The outputs seem fine" "Performance seems fine" "Customers seem fine with it" This is a billion-dollar company's evaluation strategy. Seems. Fine. The engineer presenting looked

Sebfox (@sebfox1) 's Twitter Profile Photo

Evals don't have to feel like a complex chore, i've seen many teams have huge success with a few simple evals. The real danger comes from overthinking, not shipping & spending more time on evals than the feature itself.

Evals don't have to feel like a complex chore, i've seen many teams have huge success with a few simple evals. 

The real danger comes from overthinking, not shipping & spending more time on evals than the feature itself.
Sebfox (@sebfox1) 's Twitter Profile Photo

Team A: "We need another week to validate this change. Sarah from the medical team is on vacation and she's the only one who can check the healthcare responses." Team B: "Medical accuracy is at 0.91, up from 0.87 last week. The failing cases are all around dosage

Sebfox (@sebfox1) 's Twitter Profile Photo

The metric nobody tracks: Time between "I wonder if this will work" and "I know if this worked." For most AI teams: 2-3 weeks For the best teams: 2-3 hours This isn't about moving fast and breaking things. It's about moving fast and knowing exactly what you're doing. I've

Sebfox (@sebfox1) 's Twitter Profile Photo

Listened to a team retro yesterday. They were reflecting on how much had changed in six months. "Remember when we used to argue for hours about whether the new prompt was 'better'?" Now their deployment decisions take minutes. Not because they care less about quality – because

Andrew Ng (@andrewyng) 's Twitter Profile Photo

Readers responded with both surprise and agreement last week when I wrote that the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error

Sebfox (@sebfox1) 's Twitter Profile Photo

Every AI conference: "Evals are critical. You need comprehensive datasets. Build robust testing frameworks." Every AI team in practice: "Does it look okay? Ship it." Look, I know evals are critical – I've literally dedicated my life to building them. But they're also boring,

Sebfox (@sebfox1) 's Twitter Profile Photo

Why your engineering team hates evals - Engineers have been shipping software successfully for decades with a simple toolkit: Unit tests CI/CD pipelines Production monitoring Then AI shows up and suddenly everyone's telling them to build datasets, craft judge prompts, and

Sebfox (@sebfox1) 's Twitter Profile Photo

"We test in production" used to be a joke. Now it's the strategy. Not because teams are reckless. Because that's where the real signal lives. Your carefully crafted evaluation dataset will never capture the weird stuff actual users do. Your judge prompts won't predict which

Sebfox (@sebfox1) 's Twitter Profile Photo

There's a bit of a skill set mismatch going on. We're asking engineers to become data scientists. "Just analyze your traces" "Build comprehensive datasets" "Craft better judge prompts" But engineers don't live in notebooks and spreadsheets. They live in IDEs and terminals.

Sebfox (@sebfox1) 's Twitter Profile Photo

The vibes-to-metrics evolution: Stage 1: Ship on vibes Stage 2: Users complain in Discord Stage 3: Add basic monitoring Stage 4: Find the one metric that matters Stage 5: Optimize that metric relentlessly Every successful AI product I've studied followed this path. Not