The world’s leading publication for data science, AI, and ML professionals.

DevOps – Serverless OCR-NLP Pipeline using Amazon EKS, ECS and Docker

How we were able to auto-scale an Optical Character Recognition Pipeline to convert thousands of PDF documents into Text per day using…

How we were able to auto-scale an Optical Character Recognition Pipeline to convert thousands of PDF documents into Text per day using event driven microservices architecture driven by Docker and Kubernetes

Image by mohamed Hassan from Pixabay
Image by mohamed Hassan from Pixabay

On a recent project we were called in to create a pipeline that has the ability to convert PDF documents to text. The incoming PDF documents were typically 100 pages and could contain both typewritten and handwritten text. These PDF documents were uploaded by users to an SFTP. Normally, on average there would be 30–40 documents per hour, but as high as 100 during peak periods. Since their business was growing the client expressed a need to OCR up to a thousand documents per day. These documents were then fed into an NLP pipeline for further analysis.

Let’s do a Proof of Concept – Our Findings

Time to convert a 100-page document – 10 minutes

Python process performing the OCR consumed around 6GB RAM and 4 CPU.

We needed to come up with a pipeline that not only keeps us with the regular demands but can auto-scale during peak periods.

Final Implementation

We decided to architect a serverless pipeline using event driven microservices. The entire process was broken down as follows:

  • Document uploaded in PDF Format – Handled using AWS Transfer for SFTP
  • Trigger an S3 event Notification for when a new PDF document is uploaded – Trigger a Lambda Function
  • Lambda Function adds an OCR event in Kinesis Streams
  • OCR microservice is triggered – Converts PDF to Text using the Tesseract Library (One per page). Text Output saved as JSON document in MongoDB
  • Add an NLP event in Kinesis Streams
  • NLP microservice reads JSON from MongoDB. Final Results of NLP saved back to MongoDB
Image by Author
Image by Author

Technology Stack

Data Ingestion – AWS SFTP Service

Microservices – Docker Images stored in Amazon Elastic Container Registry (ECR)

Container Orchestration – Amazon Elastic Kubernetes Service (Amazon EKS) over EC2 Nodes

Serverless Compute Engine for Containers – AWS Fargate

Infrastructure Provisioning – Terraform

Messaging –Amazon Kinesis Streams

Network Architecture

Image by Author
Image by Author

Cluster Autoscaling

Cluster autoscaling was achieved using a combination of Horizontal & Vertical Scaling as below:

Vertical Scaling – Containers

Based on our calculations we were able to support 25 running containers on the given EC2 node. We started the OCR microservice with 3 Replica containers (minReplicas=3) and set the Maximum to 25 (maxReplicas=25). We also set the targetAverageUtilization=15, which means if the container CPU utitlization goes over 15% i.e. the container is running processing a document then spin up a new container to a max of 25 on a given EKS node.

Image by Author
Image by Author

Horizontal Scaling – EKS Nodes

If the current ELS Node is full i.e. 25 concurrently running containers then a new EKS Node is automatically provisioned by EKS. Thereafter, the vertical scaling takes over on the new node and spins up new containers.

This way the infrastructure is able to support hundreds of OCR and NLP processes. After the peak demands have been met a wait period kicks. After the expiry of the wait period the newly deployed EKS Nodes and Containers are scaled back so that the optimal resource allocations are met.

I hope this article was helpful in kick-starting your DevOps knowledge. Topics like these are covered as part of the DevOps course offered by Datafence Cloud Academy.


Related Articles