Operations
Monitoring & alerting
Metrics, dashboards, retries, and paging.
What we watch
Per workflow, per node:
- Execution count, success rate, p50/p95/p99 latency, retry count, error type breakdown.
- Payload size percentiles (catches upstream shape changes early).
- Model token usage + cost per run.
Retries
Default policy: 3 attempts with exponential backoff (400ms → 3200ms). Idempotency keys on every side-effect. Terminal failures land in a dead-letter queue with the full input.
Alerting
- Sev-1 (production blocker): PagerDuty + Slack + SMS to on-call.
- Sev-2 (degraded): Slack + email digest.
- Sev-3 (informational): weekly digest.
Alert rules live in code, versioned with the workflow. No silent thresholds.
Related