Abstract
-
For observability, Amazon CloudWatch is one of the options to collect and track metrics as well as provide alerts based on the metric threshold setting. Especially when you donβt want to use external monitoring and observability tools such as Datadog or Prometheus, and donβt want to pay extra costs for data transferring out.
-
The thing is that we need an automated way of setting up CloudWatch Alarms for EC2 instances and customising the metrics as well as alerts. Especially when there are new EC2 created by autoscaling or on-demand, we need to trigger the automation process to install cloudwatch agent on the EC2 instances as well as set up the alarm for them such as CPU utilization, disk I/O, and memory usage.
-
In this blog post, I demonstrate how to automate the setup and configuration of CloudWatch alarms on Amazon EC2 in addition to providing alert notification to the Slack channel.
Table Of Contents
Open Table Of Contents
π Solution overview
The CloudWatch Auto Alarms and Install CloudWatch Agent AWS Lambda functions help to quickly and automatically create a standard set of CloudWatch alarms for the new Amazon EC2 instances (or just reboot the EC2 for generating a Running event state). It saves the time for installing cloudwatch agent as well as agent configuration setup, deploying alarms and setup metric alerts, plus reduces the skills gap required to create and manage alarms.
This blog post gives an example of setting default configuration and creating alarms for the Amazon EC2 with Amazon Linux AMI (but the lambda function supports multiple OS such as Ubuntu, Redhat, SUSE and Windows):
-
CPU Utilization
-
Disk Space Used
-
Memory Used
CloudWatch agent predefined metric sets - Advanced
CPU: cpu_usage_idle, cpu_usage_iowait, cpu_usage_user, cpu_usage_system
Disk: disk_used_percent, disk_inodes_free
Diskio: diskio_io_time, diskio_write_bytes, diskio_read_bytes, diskio_writes, diskio_reads
Mem: mem_used_percent
Netstat: netstat_tcp_established, netstat_tcp_time_wait
Swap: swap_used_percent
The created alarms take action of notifying an Amazon SNS topic. The SNS topic is subscribed by the AWS ChatBot associated with the Slack channel to send alert messages directly to Slack.
π Flow overview
- Prerequisites: EC2 instances use AMI versions which support automatically installing SSM agents from startup.
- In the flow chart above, it performs the following steps
- For any EC2 instance launched or restarted, the eventbridge rule
install-cw-agent-install-cw-agent
andcw-auto-alarm
catch the event of new Running state from the EC2 instance and then trigger their targets here are lambda functions - The lambda function
install-cw-agent-install-cw-agent
does following steps- Get instance tag to check if it contains tag-key
Create_Auto_Alarms
(reference toALARM_TAG
environment of the lambda) then proceed, otherwise, ignore - Run the SSM documents
AWS-ConfigureAWSPackage
to install cloudwatch agent on the target instance and then run SSMAWS-RunShellScript
to load cloudwatch agent config from SSM parameter store and start cloudwatch agent service
- Get instance tag to check if it contains tag-key
- The lambda function
cw-auto-alarm
based on EC2 instance tags to create cloudwatch alarms with formatAutoAlarm-<InstanceID>-<cw-namespace>-<MetricName>-<ComparisonOperator>-<Period>-<EvaluationPeriods>-<Statistic>-<CloudWatchAutoAlarms>
. These alarms send alert to the SNS topic which is defined inDEFAULT_ALARM_SNS_TOPIC_ARN
environment - When the SNS topic receives a message, it forwards it to AWS ChatBot webhook and then the chatbot sends an alert message to the registered slack channel.
- If thereβs any instance terminated, the eventbridge rule
cw-auto-alarm
catches the event and then triggers the lambda function to delete the alarms according to the terminated instances
- For any EC2 instance launched or restarted, the eventbridge rule
π Deploying the solution
-
For infrastructure as code, in this blog post I use CDK Typescript.
-
Stack visualize chart
-
Prerequisites:
- Add
AWS Chatbot
app to slack channel. - Provide slack workspace ID and slack channel ID to the CDK code.
- Add
-
Deploy cdk stacks
cdk deploy --all
π Test alarms
-
The above
cdk deploy --all
includes creating EC2 instance but it might be a gap for eventbridge rule to catch event of Running state change, so for sure, just restart the EC2. -
Create one more instance to test creating alarms for new instance launch through the stack
test-ec2
-
EC2 with proper tags
will be created according alarms
-
Now we access to a EC2 using SSM connect and run
cpu-dump.py
andtest-mem-alert.py
test scripts. We will see the alert then.- In-alarm threadhold
- Slack alert
π Cleanup
- Destroy all the stacks within this project by running
cdk destroy --all
- Cloudwatch logs groups which are created by Lambda functions are not parts of the project stacks so they are not deleted. Although the log group have retention you might want to delete them for cleaning up completely
π Conclusion
- In this post, I leverage serverless services such as lambda function, eventbridge rule, systems manager, and SNS to provide an automation way of creating CloudWatch alarms and alerts for Amazon EC2 instances in an AWS account.
- By using the SSM agent from the Systems manager, the lambda function can remotely install cloudwatch agent in the EC2 instances for collecting system logs and metrics and then create cloudwatch alarms properly based on the tags of EC2.
- The solution is deployed using AWS CDK typescript. For production, I encourage creating the CDK pipeline to deploy the IaC through codepipeline completely.
References: