Connor Holly

A domain-agnostic framework for evaluating AI systems that separates capability from behavior, uses pairwise comparison over absolute scoring, and treats eval design as a prerequisite to system design — not an afterthought.

The Pattern

Build evals across two dimensions:

Capability evals answer "can the system do X?" — factual correctness, task completion, format compliance. These have clear right/wrong answers.

Behavior evals answer "does the system do X appropriately?" — tone, safety, helpfulness, refusal calibration. These require judgment.

For each eval type, follow this pipeline:

Ground Truth Definition --> Test Case Design --> Blinded Execution --> Pairwise Comparison --> Statistical Analysis

Ground truth tracing. Every test case has a known-correct answer or at least a known-correct direction. Without this, you're measuring vibes.

Pairwise comparison over absolute scoring. Don't ask "rate this output 1-5." Ask "which of these two outputs is better?" Humans are dramatically more reliable at relative judgment than absolute rating. Run enough pairs to get statistical significance.

Blinding. Evaluators don't know which system (model, prompt, configuration) produced which output. This eliminates brand bias and anchoring effects.

Ablation testing. Remove one component at a time (a prompt section, a retrieval step, a model upgrade) and measure the delta. This tells you what's actually contributing versus what's cargo cult.

SUT (System Under Test) abstraction. Evaluate the interface, not the implementation. Define your eval contract once, then swap models, prompts, or architectures underneath without rewriting eval logic.

Key Decisions

Design evals before building the system. This is test-driven development for AI. If you can't define what "good" looks like before you build, you'll rationalize whatever you end up with. Writing evals first forces clarity about requirements.

Minimum sample sizes. Five examples is anecdotal. Fifty starts to be meaningful. Set confidence intervals and don't ship based on a handful of cherry-picked successes.

Separate eval authoring from eval execution. The person who writes the test cases should not be the person who tunes the system. Same principle as code review — fresh eyes catch what the builder rationalizes away.

When to Use It

Any time you're comparing AI approaches, validating a prompt change, or deciding between models. Especially critical before expensive migrations ("should we switch from Model A to Model B?") where intuition-based decisions can waste weeks. The upfront cost of proper eval design is 2-4 hours; the cost of shipping the wrong system is months.