Please check https://aws.github.io/aws-eks-best-practices/ for more comprehensive EKS best practice!
- Think about multi-tenancy, isolation for different environment or different workload
- Isolation at account level using AWS organization
- Isolation at the network layer ie. different VPC & different cluster
- Use different Nodes Group (Node pool) for different purpose/category e.g. create dedicated node groups for Operational tools such as CI/CD tool, Monitoring tool, Centralize logging system.
- Separate namespace for different workload
Reliability | Principles
- Recommended to use dedicated VPC for EKS
- Modular and Scalable Amazon EKS Architecture
- Plan your VPC & Subnet CIDR, avoid complexity of using multiple CIDRs in a VPC and CNI custom networking
- Understand and check Service Quota of EKS/Fargate and other related services
- Implement Cluster Autoscaler to automatically adjust the size of an EKS cluster up and down based on scheduling demands.
- Consider the number of worker nodes and service degradation if there is node/AZ failure.
- Mind the RTO.
- Consider to have buffer node.
- Consider not to choose very large instance type to reduce blast radius.
- Enable Horizontal Pod Autoscaler to use CPU utilization or custom metrics to scale out pods.
- Use Infrastructure as code (Kubernetes manifest files and templates to provision EKS clusters/nodes etc) to track changes and provide auditability
- Use multiple AZs. Spread out application replicas to different worker node availability zones for redundancy
- Mind with your persistent pods that use EBS as PersistentVolume. Use annotation e.g
topology.kubernetes.io/zone=us-east-1c
- Mind with your persistent pods that use EBS as PersistentVolume. Use annotation e.g
- Highly available and scalable worker nodes using Auto Scaling Groups, use node group
- Consider to use Managed Node Groups for easy setup and high available nodes while updates or temination
- Consider to use Fargate so you don't have to manage worker nodes. But please beware of Fargate limitation.
- Consider to separate Node Group for your application and utility functions eg. Logging databse, Service Mesh Control Plane
- Deploy aws-node-termination-handler. It detects if node will be unavailable/terminated such as Spot interuption then ensure no new work is scheduled there, then drain it, removing any existing work. Tutorial | Announcement
- Configure Pod Disruption Budgets (PDBs) to limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions eg. when upgrading, rolling deployment and other use case.
- Use AWS Backup to backup EFS and EBS
- Use EFS for Storage Class : Using EFS does not require pre-provisioning the capacity and enables more efficient pod migrations between worker nodes (removing node-attached storage)
- Install Node Problem Detector to provide actionable data to heal clusters.
- Avoid configuration mistake such as using Anti-affinity that makes pod cannot be rescheduled due to node failure.
- Use Liveness and Readiness Probes
- Practice chaos engineering, use available tools to automate.
- Kill Pods Randomly During Testing
- Implement failure management in microservice level, e.g. Circuit breaker pattern, Control and limit retry calls (exponential backoff), throttling, make services stateless where possible
- Practice how to upgrade the cluster and worker nodes to new version.
- Practice how to drain the worker nodes.
- Practice Chaos engineering
- Use CI/CD tools, automate and having process flow (approval/review) for infrastructure changes. Consider to implement Gitops.
- Use multi-AZ solution for persistent volume eg. Thanos+S3 for Prometheus
Performance Efficiency | Principles
- Inform AWS support if you need to pre-scale the Control Plane (Master node & Etcd) in case of sudden load increment
- Choose the right EC2 instance type for your worker node.
- Undestand the pros & cons of using many small node instances or few large node instances. Consider the OS overhead, time required to pull image in a new instance when it scale, kubelet overhead, system pod overhead, etc.
- Understand the pod density limitation (maximum number of pods supported by each instance type)
- Use single-AZ node groups if necessary. Typically, one of the best practices is to run a microservice across Multi-AZ for availability, but for some workload (such as Spark) that need micro-second latency, having high network I/O operation and transient, it is make send to use single-AZ.
- Understand the performance limitation of Fargate. Do the load test before going to production.
- Ensure your pod requests the resources it needs. Define
request
andlimit
resources such as CPU, memory - Detect bottleneck/latency in a microservice with X-Ray, or other tracing/APM products
- Choose the right storage backend. Use Amazon FSx for Lustre and its CSI Driver if your persistent container need high-performance file system
- Monitor pod & nodes resource consumption and detech bottlenect. You can use CloudWatch, CloudWatch Container Insight, or other products
- If needed, launch instances (worker nodes) in Placement Groups to leverage low latency without any slowing. You can use this CloudFormation Template to add new node groups with non-blocking, non-oversubscribed, fully bi-sectional connectivity.
- If needed, setup Kubernetes CPU Management policy to 'static' for some pods who need exclusive CPUs
- Minimize the wasted (unused) resources when using EC2 as worker node.
- Choose the right EC2 instance type and use cluster auto scaling.
- Consider to use Fargate
- Consider to use a tool like kube-resource-report to visualizing the slack cost and right sizing the requests for the containers in a pod.
- Use Spot Instances or mix On-Demand and Spot by utilizing Spot Fleet. Consider using Spot instances for Test/Staging env.
- Use Reserved Instance or Saving Plans
- Use single-AZ node groups for workload with high network I/O operation (e.g. Spark) to reduce cross-AZ communication. But please validate if running Single-AZ wouldn’t compromise availability of your system.
- Consider managed services for supporting tool such as monitoring, service mesh, cetralized logging, to redure your team effort & cost
- Tag all AWS resources when possible and use Labels to tag Kubernetes resources so that you can easily analyze the cost.
- Consider to use self-manage Kubernetes (not using EKS) for non-HA cluster. You can setup using Kops for your small k8s cluster.
- Use Node Affinities using nodeSelector for pod that requires specific EC2 instance type.
Operation: Principles
- Use IaC tool for provisioning the EKS cluster such as
- Consider to use package manager like Helm to helps you install and manage applications.
- Automate cluster management and applicatoin deployment using GitOps. You can use tools like Flux or [others](https://github.com/weaveworks/awesome-gitops | workshop
- Use CI/CD tools
- Practice to do EKS upgrade (rolling update), create runbook. - GitHub - hellofresh/eks-rolling-update: EKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster. - Open Sourcing EKS Rolling Update: A Tool for Updating Amazon EKS Clusters
- Monitoring
- Understand your Workload Health. Define KPI/SLO and metrics/SLI then monitor through your dashboard & setup alerts
- Understand your Operational Health. Define KPI and metrics such as mean time to detect an incident (MTTD), and mean time to recovery (MTTR) from an incident.
- Use detailed monitoring using Container Insights for EKS to drill down into service, pods performance. It also provides diagnostic information and consider to view additional metrics and additional levels of granularity when a problem occurs.
- Monitor control plane metrics using Prometheus
- Monitoring using Prometheus & Grafana
- Logging
- Consider DaemonSet vs Sidecar mechanism. DaemonSet is preferable EC2 worker node, but you need to use Sidecar pattern for Fargate.
- Control Plane Logging
- You can use EFK stack or FluentBit, Kinesis Data Firehouse, S3 and Athena
- Tracing
- Monitor fine-graid transaction using X-Ray eksworkshop.com. It is good to monitor blu-green deployment too. Other tools
- Practice the Chaos Engineering, you can automate using some tools
- Configuration
- Appmesh + EKS demo / lab: GitHub - PaulMaddox/aws-appmesh-helm: AWS App Mesh ❤ K8s
- AWS Cloud Map:
- AWS Cloud Map: Easily create and maintain custom maps of your applications | AWS News Blog
- AWS CloudMap + Consul:
Security | Principles
- Understand the shared responsibility model for different EKS operation modes (self-managed nodes, managed node group, Fargate)
- AWS Security Best Practices for EKS
- Integrating security into your container pipeline | workshop
- Use CNI custom networking, if your pod need to have different Security Group with its nodes, or pods to be placed in private subnets but the node is actually in public subnet.
- Cloudtrail EKS API log
- Consinder to enable continuous delivery of CloudTrail events to an Amazon S3 bucket by creating a trail to records events < past 90 days
- Use network policy for East-West traffic: Calico
- Use security groups for pods only for K8s > v1.17. See some consideration
- Introducing Fine-Grained IAM Roles for Service Accounts | AWS Open Source Blog
- Use AWS Key Management Service (KMS) keys to provide envelope encryption of Kubernetes secrets stored
Packer for AMI build : Packer configuration for building a custom EKS AMI