This document is intended to facilitate discussions and promote industry best practices for cloud-based services to achieve a resilient posture, or high availability. Achieving a highly resilient posture means being prepared for all kinds of failures such as natural disasters, security breaches, network failures, software bugs, high traffic loads, unexpected user behaviors, and the capacity to handle the unexpected gracefully and lightning fast. High availability is the product of coordinated efforts in people, processes, and technical strategy, including but not limited to,
- Disaster Recovery strategy (ranging from backups to full active/active multi-cloud deployments)
- Continuous deployment (infrastructure-as-code, automated unit/integration/load testing, staggered deployments to multiple regions with bake-time and rollback alarms)
- Observability (covering KPIs, health indicators, and assum