Connor Holly

What it does

Turns verification into executable code. Instead of manually checking whether AI skills, system health, or workflows behave correctly, you define test suites in bash scripts that run assertions and produce timestamped reports with severity ratings.

The pattern

Split evaluations into two tiers:

Fast checks run in seconds with no agent spawning. They verify file existence, config validity, expected output formats, and system health. Think of them as unit tests for your operational environment.

Agent evals spawn real AI agents, feed them scenarios, and verify behavior. These test whether skills trigger on the right keywords, whether multi-step workflows complete, and whether outputs meet quality bars. Use LLM-as-judge for subjective assessments (tone, completeness, relevance) where binary assertions fall short.

Each eval script follows a standard shape:

Define test cases as input/expected-output pairs
Run the system under test
Assert on outputs (exact match, regex, or LLM judge)
Categorize results as pass, warn, or fail
Write a timestamped markdown report

Reports land in a results directory with filenames like eval-name-20260215-143022.md. This creates a natural audit trail. You can diff reports across runs to spot regressions.

Key decisions

Bash over framework. Test runners add dependencies and opinions. Bash scripts are portable, readable, and run anywhere. The overhead of a testing framework is not justified when your test cases are "run this command, check this output."

LLM-as-judge for subjective quality. Some checks cannot be reduced to string matching. "Did this summary capture the key points?" needs a judge. Feed the judge the input, the output, and a rubric. Parse its verdict programmatically.

Severity tiers, not binary pass/fail. Warn means degraded but functional. Fail means broken. This distinction prevents alert fatigue and helps prioritize fixes.

Run fast checks on every session start. Agent evals run on demand or on a schedule. The cost and time difference between the two tiers is 100x, so treat them differently.

When to use it

Use mini evals when you have AI skills or automated workflows that could silently break. They catch drift early: a renamed file, a changed API response shape, a skill that stopped triggering. Particularly valuable after refactoring, when updating prompts, or when onboarding new automation. If you find yourself manually spot-checking the same thing twice, write an eval instead.