Connor Holly

← AI Skills

Production Release Guardrails

Agent Orchestration

deploymentsafetyrollback

What It Does

Automates staged production promotion with health checks between each service, error regression detection, scheduled follow-up checks, and automated rollback when things go wrong.

The Pattern

Production deployments are a sequence of promotions, not a single event. Each service gets promoted independently with a guard window between steps. If any guard check fails, the pipeline stops.

The staged promotion flow:

  1. Promote Service A (e.g., manager dashboard)
  2. Wait. Run health checks + error monitoring.
  3. Promote Service B (e.g., backend API)
  4. Wait. Run health checks + error monitoring.
  5. Promote Service C (e.g., user dashboard)
  6. Wait. Run health checks + error monitoring.
  7. Promote Service D (e.g., marketing website)
  8. Final health check across the full stack.

Health checks hit actual endpoints and verify responses. Not just "is it up" but "is it returning correct data." The check script supports a --fail-on-fatal flag that distinguishes between warnings (degraded but functional) and fatal errors (broken).

Error regression detection compares Sentry error rates before and after deployment. A deploy timestamp is recorded at the start. Follow-up checks at +1h, +4h, and +12h look for new error types or spikes in existing ones that appeared after that timestamp.

Scheduled follow-ups use launchd (macOS) to fire one-shot checks at predetermined intervals. These run independently of whether you're at your machine. Each follow-up runs the full stack health check plus Sentry regression analysis.

Automated rollback creates revert PRs when guard checks fail. It's git-based, not platform-native. The rollback script:

  1. Identifies the merge commit from the latest main -> prod promotion
  2. Creates a revert commit
  3. Opens a PR against the production branch
  4. If the failing service can be identified unambiguously, targets just that service
  5. If the failure source is ambiguous, skips auto-rollback and flags for manual triage

A release manifest (JSON file) records which repos were promoted and when, so follow-up checks can correlate failures to specific deployments.

Key Decisions

Staged, not simultaneous. Promoting all services at once makes it impossible to identify which change caused a regression. Sequential promotion with guard windows isolates failures.

Git-based rollback, not platform rollback. Revert PRs leave an audit trail, go through CI, and can be reviewed. Platform rollback (like "deploy previous build") is opaque and may not revert database migrations or configuration changes.

Repo-scoped rollback. The system only reverts the specific service that appears to be failing, not the entire stack. This minimizes blast radius.

Dry-run support. Every step supports --dry-run for validation before a live release. Test the scheduling, the health checks, and the rollback logic without actually promoting anything.

When to Use It

  • Multi-service architectures where services are deployed independently
  • Production environments where downtime has real cost
  • Teams that want automated safety nets for overnight or unattended deployments
  • Any deployment pipeline that currently relies on manual post-deploy verification