avoidik/1_ecs_note.md

## 1_ecs_note.md

      
    Raw
  

              1_ecs_note.md
            
          
    ECS Best Practices


Understand and check Service Quota of ECS/Fargate and other related services


Cluster

Use IaC for setting up resources eg. CloudFormation, AWS CDK, Terraform

Use copilot for simple setup
CloudFormation reference architecture template

https://github.com/nathanpeck/ecs-cloudformation
https://github.com/awslabs/aws-cloudformation-templates/tree/master/aws/services/ECS


CDK for ECS: blog
Terraform examples


Use EC2 Auto Scaling Group & Capacity Provider for better scaling
Enable AWS VPC Trunking setting in account level for higher task density in some EC2 types instance
Favor configuring ECS Clusters with EC2 instances in at least 3 AZ, keep keep the instance counts balanced across the AZs. More on availability best practices


Use Amazon ECS-optimized AMIs.

Using different OS is hard to maintain: upgrade OS, patching, update Docker, update ECS Agent, etc
Subscribe for update notification.


Launching EC2 Container Instance

Don't use public IP address (Turn off Auto-assign Public IP)
Make EC2 instance immutable.

Better not to expose SSH for remote login, use AWS System Manager Run Command & Session Manager instead.


Use Spot Instance whenever possible eg. for Development environment

Find the instace type that are not frequently interrupted
Set Spot pricing to little bit higher than avarage
Use Spot Fleet to deploy the target capacity you request (expressed in terms of instances or a vCPU count)


Understad how the EC2 container instance works

Don't use reserved ports for your application (Linux TCP: 22, 2375, 2376, 51678, 51679, 51680)
Don't store log files or any persistent data in the container - it will make docker storage full
Look into /data directory for troubleshooting (contains information about the cluster and the agent state)
Set Container Agent config if you harden the OS using SELinux or Apparmor
For better performance, tune ECS_IMAGE_PULL_BEHAVIOR & Image/Task Clean up parameters based on how often you deploy -


Optimize ECS task density using ENI trunking

Opt-in for awsvpcTrunking and use awsvpc as network mode
Supported EC2 instance types and read some considerations


{Day2} Setup Automated update EC2 instances, since doing it manually is hard and error prone


Fargate

Understand the use case. EC2 or Fargate?
Understand its Task Definition limitation & CPU/Memory configuration. If it is not match with your workload requirement then use EC2 launch type.
User EFS for persistent storage, but consider the performance, immutable task is always better.
Use it for general purpose workload (burstable), assume your Fargate task will run on 't' or 'm' type of instance
Don't use it if you need GPU, high network bandwidth (50Gbps, 100Gbps), very high vPUC or RAM
User Fargate on Spot instances to reduce the cost. Or use both Fargate & Fargate Spot using Capacity Provider


Networking

Use separate VPC, don't mix up with other service eg. EC2 instances that are not belong to the cluster.

Plan your VPC & Subnet CIDR, avoid complexity of using multiple CIDRs in a VPC
Use IP address tools


VPC & Subneting architecture patterns: https://containersonaws.com/architecture/
Makes Container Registry as near as possible with your cluster (for low latency & speed up docker pull).

ECS Cluster & ECR are better in the same Region


Use network mode = awspvc for greater security using SG, easy troubleshooting (using VPC flow log)
Use network mode = host, if you want the task bypasses Docker's built-in virtual network and maps container ports directly to the EC2 instance's network interface directly


Task Definition

Don't store env variables in the task definition, instead use Parameter Store - more secure.
Always set healthCheck parameter in the Container Definition for task that will be part of ECS service or using ECS Service Discovery.

Adjust other health check parameters: interval, timeout, retries, startPeriod based on your app characteristics


Service

Consider to use placement strategy

use “availability-zone” as spread attribute, to spread the Tasks being launched as evenly as possible across AZ


Service Discovery

Use Amazon ECS Service Discovery/CloudMap (internal domain resolution) for inter-service communication inside a cluster.
Be aware of SRV and A records for service lookup using DNS. 'A' record is simple, using SRV records you might change your app code since it will requires the app to resolve the IP address and the port.


Highly recommended to use ALB instead of ELB - dynamic port mapping, more detail monitoring & access log
Use placement strategy and constraint to maximize your resource. CDK example, Terraform example
Tune scaling parameters: healthcheck grace period and scaling cooldowns
Recommended to use Target Tracking Scaling Policies instead of Step Scaling Policies. Common scaling metric is based on EC2's CPU utilization or request count per target of ALB's target group.
Use API gateway to expose services


Observability

Send application log to standar output and stream to centralize logging. Take advantage of aws-logs driver & CloudWatch
Enable CloudWatch Container Insight to collect more detail monitoring metrics and logging.
Use X-Ray for transaction tracing for troubleshooting perfomance.


Deployment

Blue/Green deployment using CodePipeline, CodeBuild, CloudFormation and Lambda


Security

Set least privilege port access in SG of EC2 Container instances
Set least privilege for Container instance IAM role
Consider using TLS end-to-end communication with NLB and evaluate some options to store/manage your certificates
Stored value securely in AWS Systems Manager Parameter Store or AWS Secrets Manager, then inject data into containers  in the Container Definition of an Task Definition. Try the lab!


Cost Optimization

Right sizing EC2 container instances
Set tagging for all Containter instances
Consider to use EC2 Spot and Fargate Spot


## 2_ecs_fargate.md

      
    Raw
  

              2_ecs_fargate.md
            
          
    What you need to know (be aware of) when using ECS on Fargate.


Limitation Fargate do not support all of the task definition parameters. ref

Cannot use provilaged mode
Should use awsvpc mode -> Task will have ENI and a primary private IP address
Cannot use gpu
No placement constraint
Task CPU and memory (min: 0.25 vCPU, 0.5GB RAM, max: 4 vCPU, 30 GB RAM)
Logging: awslogs, splunk, firelens, and fluentd


Optional need Amazon ECS task execution IAM role for call other AWS service, e.g. ECR
Fargate platform version realease will provides update on kernel or operating system updates, new features, bug fixes, or security update
Task automated scheduled-retirement: you will be notified by email

Task is stopped or terminated by AWS. If it is part of the service, it will be updated automatically.
Reason:

Irreparable failure of the underlying hardware
Task has a security vulnerability


Fargate task recycling

When a security or infrastructure update is needed
No notification before recycling process
Only affect task that part of service (not standalone task)


Fargate makes no network throughput guarantees, nor does it guarantee equal CPU performance among tasks,
Expose Fargate using API gateway, VPC Link & NLB