Recently due to some updates from our security team, we had to install an agent on all our hosts in the AWS account. We run 1100+ EC2 instances in our account. These servers have varied OS (Amazon Linux, Fedora, CentOS, RHEL, Ubuntu, Windows, FreeBSD, etc. Also, these servers power various workloads like EMR (various versions), EKS, ECS, Airflow, Tableau, Alation. Many of these are vendor configured servers that have their own AMIs. Creating AMIs for each type with the agent would have taken a long time and a huge effort. Also, there are servers that can’t tolerate restart (for User data scripts to work).
AWS has a service named System Manager, which allows us to run remote commands without the need for SSH. The basic requirement for running the SSM document is that host should have amazon-ssm-agent
running and the host should have an IAM role having access to SSM (AmazonSSMManagedInstanceCore). The below diagram summarizes how the SSM Run Command works.

Low level API calls in SSM run command:
- SSM Agent Updates instance detail to SSM Service endpoint. This happens every 5 min(ssm:UpdateInstanceInformation). This update is what powers the run command instance details.
- Agent Keeps on Polling(Long) For messages from Message Delivery Service (ec2messages:GetMessages).
- Once it receives the messages intended for instance, it sends an acknowledgement(ec2messages:AcknowledgeMessage).
- It downloads the document from SSM(ssm:GetDocument)
- Executes the document.
- If in between it receives message for cancellation it will cancel the document execution.
- Sends back the response to Message Delivery Service (ec2messages:SendReply).
Luckily, we had SSM agents installed on these servers. We created a managed policy for SSM and other needed access and attached this policy to the hosts’ IAM role. We took advantage of boto3 to get all IAM roles having EC2 service as a trusted entity and added the managed policy to the role.
Once we met the prerequisite satisfied we took advantage of the System Manager’s Document & Run command feature to install the agent. For the running instances, we took a tag-based approach and we ran the document based on that tag.

Since our platform uses lots of autoscaling we also needed to put agents on the new servers spun by AWS-managed autoscaling. We could have used the Launch template’s user data, but EMR managed scaling groups (Instance groups) do not have user data and we did not want to interfere with the existing bootstrap scripts. We used CloudWatch, Lambda, SQS, and SSM to work with these.

- EC2 Instance comes up, it generates events and sends them to CloudWatch.
-
CloudWatch taps the events based on Instance State Change (running). It sends that event to SQS Queue.
SQS queue adds a delay of 3 min on the message and then batches the messages in a 3 min window.

- AWS Lambda receives the messages in batch. It filters the instances which are managed by SSM. For the Instances which are not managed by it sends it to another database. All the instances which are managed by SSM it groups them based on the platform (Linux/Windows). It invokes the corresponding document(s) in SSM.

- SSM document sends the command to all the target instances and installs the agent on the servers.


This approach allowed us to run installation commands on all our boxes in 3 days (we did it in batch to reduce the blast radius). All new boxes are automatically patched via lambda. We can also send the commands using cli.
aws ssm send-command
--document-name "<Document_name>"
--targets "Key=instanceids,Values=i-1234567890abcdef0"
--timeout-seconds 300
Although we did this for installing a particular agent, the solution can be extended to the various use cases (automation being one). One I can straight away think of is a replacement for airflow.contrib.operators.ssh_operator. With SSM we don’t have to create an SSH connection. All is taken care of by SSM. The only catch in the solution is 15 min timeout of SSM. But if you follow the fire and forgot approach it will be a good option to explore.
Happy Cloud Computing !!