-
Understand and check Service Quota of ECS/Fargate and other related services
-
Cluster
- Use IaC for setting up resources eg. CloudFormation, AWS CDK, Terraform
- Use EC2 Auto Scaling Group & Capacity Provider for better scaling
- Enable AWS VPC Trunking setting in account level for higher task density in some EC2 types instance
- Favor configuring ECS Clusters with EC2 instances in at least 3 AZ, keep keep the instance counts balanced across the AZs. More on availability best practices
-
Use Amazon ECS-optimized AMIs.
- Using different OS is hard to maintain: upgrade OS, patching, update Docker, update ECS Agent, etc
- Subscribe for update notification.
-
Launching EC2 Container Instance
- Don't use public IP address (Turn off Auto-assign Public IP)
- Make EC2 instance immutable.
- Better not to expose SSH for remote login, use AWS System Manager Run Command & Session Manager instead.
- Use Spot Instance whenever possible eg. for Development environment
- Find the instace type that are not frequently interrupted
- Set Spot pricing to little bit higher than avarage
- Use Spot Fleet to deploy the target capacity you request (expressed in terms of instances or a vCPU count)
- Understad how the EC2 container instance works
- Don't use reserved ports for your application (Linux TCP: 22, 2375, 2376, 51678, 51679, 51680)
- Don't store log files or any persistent data in the container - it will make docker storage full
- Look into
/data
directory for troubleshooting (contains information about the cluster and the agent state) - Set Container Agent config if you harden the OS using SELinux or Apparmor
- For better performance, tune ECS_IMAGE_PULL_BEHAVIOR & Image/Task Clean up parameters based on how often you deploy -
- Optimize ECS task density using ENI trunking
- Opt-in for
awsvpcTrunking
and useawsvpc
as network mode - Supported EC2 instance types and read some considerations
- Opt-in for
- {Day2} Setup Automated update EC2 instances, since doing it manually is hard and error prone
-
Fargate
- Understand the use case. EC2 or Fargate?
- Understand its Task Definition limitation & CPU/Memory configuration. If it is not match with your workload requirement then use EC2 launch type.
- User EFS for persistent storage, but consider the performance, immutable task is always better.
- Use it for general purpose workload (burstable), assume your Fargate task will run on 't' or 'm' type of instance
- Don't use it if you need GPU, high network bandwidth (50Gbps, 100Gbps), very high vPUC or RAM
- User Fargate on Spot instances to reduce the cost. Or use both Fargate & Fargate Spot using Capacity Provider
-
Networking
- Use separate VPC, don't mix up with other service eg. EC2 instances that are not belong to the cluster.
- Plan your VPC & Subnet CIDR, avoid complexity of using multiple CIDRs in a VPC
- Use IP address tools
- VPC & Subneting architecture patterns: https://containersonaws.com/architecture/
- Makes Container Registry as near as possible with your cluster (for low latency & speed up docker pull).
- ECS Cluster & ECR are better in the same Region
- Use network mode =
awspvc
for greater security using SG, easy troubleshooting (using VPC flow log) - Use network mode =
host
, if you want the task bypasses Docker's built-in virtual network and maps container ports directly to the EC2 instance's network interface directly
- Use separate VPC, don't mix up with other service eg. EC2 instances that are not belong to the cluster.
-
Task Definition
- Don't store env variables in the task definition, instead use Parameter Store - more secure.
- Always set
healthCheck
parameter in the Container Definition for task that will be part of ECS service or using ECS Service Discovery.- Adjust other health check parameters:
interval, timeout, retries, startPeriod
based on your app characteristics
- Adjust other health check parameters:
-
Service
- Consider to use placement strategy
- use “availability-zone” as spread attribute, to spread the Tasks being launched as evenly as possible across AZ
- Service Discovery
- Use Amazon ECS Service Discovery/CloudMap (internal domain resolution) for inter-service communication inside a cluster.
- Be aware of SRV and A records for service lookup using DNS. 'A' record is simple, using SRV records you might change your app code since it will requires the app to resolve the IP address and the port.
- Highly recommended to use ALB instead of ELB - dynamic port mapping, more detail monitoring & access log
- Use placement strategy and constraint to maximize your resource. CDK example, Terraform example
- Tune scaling parameters: healthcheck grace period and scaling cooldowns
- Recommended to use Target Tracking Scaling Policies instead of Step Scaling Policies. Common scaling metric is based on EC2's CPU utilization or request count per target of ALB's target group.
- Use API gateway to expose services
- Consider to use placement strategy
-
Observability
- Send application log to standar output and stream to centralize logging. Take advantage of aws-logs driver & CloudWatch
- Enable CloudWatch Container Insight to collect more detail monitoring metrics and logging.
- Use X-Ray for transaction tracing for troubleshooting perfomance.
-
Deployment
-
- Set least privilege port access in SG of EC2 Container instances
- Set least privilege for Container instance IAM role
- Consider using TLS end-to-end communication with NLB and evaluate some options to store/manage your certificates
- Stored value securely in AWS Systems Manager Parameter Store or AWS Secrets Manager, then inject data into containers in the Container Definition of an Task Definition. Try the lab!
-
Cost Optimization
- Right sizing EC2 container instances
- Set tagging for all Containter instances
- Consider to use EC2 Spot and Fargate Spot
-
-
Save avoidik/214399e234582f685197cde92d996aac to your computer and use it in GitHub Desktop.
ECS Best Practices Notes
- Limitation Fargate do not support all of the task definition parameters. ref
- Cannot use provilaged mode
- Should use
awsvpc
mode -> Task will have ENI and a primary private IP address - Cannot use gpu
- No placement constraint
- Task CPU and memory (min: 0.25 vCPU, 0.5GB RAM, max: 4 vCPU, 30 GB RAM)
- Logging: awslogs, splunk, firelens, and fluentd
- Optional need Amazon ECS task execution IAM role for call other AWS service, e.g. ECR
- Fargate platform version realease will provides update on kernel or operating system updates, new features, bug fixes, or security update
- Task automated scheduled-retirement: you will be notified by email
- Task is stopped or terminated by AWS. If it is part of the service, it will be updated automatically.
- Reason:
- Irreparable failure of the underlying hardware
- Task has a security vulnerability
- Fargate task recycling
- When a security or infrastructure update is needed
- No notification before recycling process
- Only affect task that part of service (not standalone task)
- Fargate makes no network throughput guarantees, nor does it guarantee equal CPU performance among tasks,
- Expose Fargate using API gateway, VPC Link & NLB
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment