Monitoring & alerting

Metrics, dashboards, retries, and paging.

What we watch

Per workflow, per node:

Execution count, success rate, p50/p95/p99 latency, retry count, error type breakdown.
Payload size percentiles (catches upstream shape changes early).
Model token usage + cost per run.

Retries

Default policy: 3 attempts with exponential backoff (400ms → 3200ms). Idempotency keys on every side-effect. Terminal failures land in a dead-letter queue with the full input.

Alerting

Sev-1 (production blocker): PagerDuty + Slack + SMS to on-call.
Sev-2 (degraded): Slack + email digest.
Sev-3 (informational): weekly digest.

Alert rules live in code, versioned with the workflow. No silent thresholds.