The world’s leading publication for data science, AI, and ML professionals.

AWS System Manager – Manage Server remotely

SSM Run command on multiple hosts in parallel without SSH

Photo by Daniel Gregoire on Unsplash
Photo by Daniel Gregoire on Unsplash

Recently due to some updates from our security team, we had to install an agent on all our hosts in the AWS account. We run 1100+ EC2 instances in our account. These servers have varied OS (Amazon Linux, Fedora, CentOS, RHEL, Ubuntu, Windows, FreeBSD, etc. Also, these servers power various workloads like EMR (various versions), EKS, ECS, Airflow, Tableau, Alation. Many of these are vendor configured servers that have their own AMIs. Creating AMIs for each type with the agent would have taken a long time and a huge effort. Also, there are servers that can’t tolerate restart (for User data scripts to work).


AWS has a service named System Manager, which allows us to run remote commands without the need for SSH. The basic requirement for running the SSM document is that host should have amazon-ssm-agentrunning and the host should have an IAM role having access to SSM (AmazonSSMManagedInstanceCore). The below diagram summarizes how the SSM Run Command works.

High Level flow of SSM Run command | Image by author
High Level flow of SSM Run command | Image by author

Low level API calls in SSM run command:

  1. SSM Agent Updates instance detail to SSM Service endpoint. This happens every 5 min(ssm:UpdateInstanceInformation). This update is what powers the run command instance details.
  2. Agent Keeps on Polling(Long) For messages from Message Delivery Service (ec2messages:GetMessages).
  3. Once it receives the messages intended for instance, it sends an acknowledgement(ec2messages:AcknowledgeMessage).
  4. It downloads the document from SSM(ssm:GetDocument)
  5. Executes the document.
  6. If in between it receives message for cancellation it will cancel the document execution.
  7. Sends back the response to Message Delivery Service (ec2messages:SendReply).

Luckily, we had SSM agents installed on these servers. We created a managed policy for SSM and other needed access and attached this policy to the hosts’ IAM role. We took advantage of boto3 to get all IAM roles having EC2 service as a trusted entity and added the managed policy to the role.

Once we met the prerequisite satisfied we took advantage of the System Manager’s Document & Run command feature to install the agent. For the running instances, we took a tag-based approach and we ran the document based on that tag.

Existing server patching solution design | Image by author
Existing server patching solution design | Image by author

Since our platform uses lots of autoscaling we also needed to put agents on the new servers spun by AWS-managed autoscaling. We could have used the Launch template’s user data, but EMR managed scaling groups (Instance groups) do not have user data and we did not want to interfere with the existing bootstrap scripts. We used CloudWatch, Lambda, SQS, and SSM to work with these.

Live installation solution design | Image by author
Live installation solution design | Image by author
  • EC2 Instance comes up, it generates events and sends them to CloudWatch.
  • CloudWatch taps the events based on Instance State Change (running). It sends that event to SQS Queue.

    SQS queue adds a delay of 3 min on the message and then batches the messages in a 3 min window.

SQS Queue config for delivery delay | Image by author
SQS Queue config for delivery delay | Image by author
  • AWS Lambda receives the messages in batch. It filters the instances which are managed by SSM. For the Instances which are not managed by it sends it to another database. All the instances which are managed by SSM it groups them based on the platform (Linux/Windows). It invokes the corresponding document(s) in SSM.
SQS Event Trigger for Lambda | Image by author
SQS Event Trigger for Lambda | Image by author
  • SSM document sends the command to all the target instances and installs the agent on the servers.
Sample SSM Doc
Sample SSM Doc
SSM document run history for a command | Image by author
SSM document run history for a command | Image by author

This approach allowed us to run installation commands on all our boxes in 3 days (we did it in batch to reduce the blast radius). All new boxes are automatically patched via lambda. We can also send the commands using cli.

aws ssm send-command 
    --document-name "<Document_name>" 
    --targets "Key=instanceids,Values=i-1234567890abcdef0" 
    --timeout-seconds 300

Although we did this for installing a particular agent, the solution can be extended to the various use cases (automation being one). One I can straight away think of is a replacement for airflow.contrib.operators.ssh_operator. With SSM we don’t have to create an SSH connection. All is taken care of by SSM. The only catch in the solution is 15 min timeout of SSM. But if you follow the fire and forgot approach it will be a good option to explore.

Happy Cloud Computing !!


Related Articles