Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP

Natural Language Processing using the Google Cloud Platform’s Kubernetes Engine

Published in

Towards Data Science

12 min readJan 18, 2020

This article is part of a larger project. If you are also interested in scalable web scraping or building highly scalable dashboards, you will find corresponding links at the end of the article.

Prerequisites to the reader
Introduction
2.1 Purpose of this Project
2.2 Introduction to scalable NLP
Architecture
Setup
How to use Spark-NLP
5.1 Overview
5.2 Spark-NLP in Python
5.3 Build Docker
Deploy to Kubernetes
6.1 Set up Kubernetes Cluster
6.2 Set up Redis as a Kubernetesservice
6.3 Fill Redis queue with tasks
6.4 Deploy a Docker Container
6.5 Check results

1. Prerequisites to the reader

The project was developed on the Google Cloud Platform and it is recommended to do the tutorial there as well. Nevertheless you can run it on a local machine, but you need to alter the code and replace some of the used resources.

This article is aimed to readers who already have some experience with the Google Cloud Platform and Linux shell. To help new readers getting started, links to additional resources can be found within this article. If you haven’t worked with Google Cloud Plattform, you can use Google’s free trial program.

2. Introduction

2.1 Purpose of this Project

The goal of this article is to show how entities (e.g. Docker, Hadoop, etc.) can be extracted from articles (based on the structure of TowardsDatascience) in a scalable way using NLP. We will also look at how we can use other NLP methods like POS tagging.

2.2 Introduction to scalable NLP

Natural Language Processing (NLP)
Natural Language Processing enables machines to understand the structure and meaning of natural language, and allows them to recognize patterns and relationships in text.

Why should it be scalable?
The processing of written language can be very complex and can take a long time without scalable architecture. Of course you can upgrade any system and use faster processors, but in doing so the costs increase disproportionately to the achieved efficiency gains. It is better to choose an architecture which can distribute the computing load over several machines.

Apache Spark
Spark is a great way to make data processing and machine learning scalable. It can be run locally or on a cluster and uses distributed Datasets, as well as processing pipelines. Further information about Spark can be found here:

Introduction to Apache Spark

MapReduce and Spark are both used for large-scale data processing. However, MapReduce has some shortcomings which…

towardsdatascience.com

Spark-NLP
Spark-NLP is a library for Python and Scala, that allows to process written language with Spark. It will be presented in the following chapters. More information can be found here:

Introduction to Spark NLP: Foundations and Basic Components

Why would we need another NLP library?

towardsdatascience.com

Redis
Redis is a key value store we will use to build a task queue.

Docker and Kubernetes
A Docker container can be imagined as a complete system in a box. If the code runs in a container, it is independent from the host’s operating system. This limits the scalability of Spark, but can be compensated by using a Kubernetes cluster. These clusters scale very quickly and easily via the number of containers. If you want to know more, please see:

Kubernetes? Docker? What is the difference?

From a distance, Docker and Kubernetes can appear to be similar technologies; they both help you run applications…

blog.containership.io

3. Architecture

To start with, this is how our architecture will look like:

Architecture for scalable text processing

As you can see, this approach is a batch architecture. The Python script processes text stored in the Google Datastore and creates a job queue. This queue will be processed by the Kubernetes pods and the results are written into BigQuery.

To keep the tutorial as short as possible, only the more computationally intensive language processing part scales. The task queue creator is a simple Python script (which could also run in Docker).

4. Setup

Start Google Cloud Shell

We will work with the Cloud Console from Google. To open it you need to create a project and activate billing.

Then you should see the Cloud Shell button in the upper right corner.

After clicking, the shell should open in the lower part of the window (if you run into trouble, use Chrome Browser). To work comfortably with the shell, I recommend to start the editor:

Get the Repository

You can download the Repository using:

git clone https://github.com/Juergen-Schmidl/TWD-01-2020.git

and the necessary model template:

$cd TWD-01-2020/5_NLP$bash get_model.sh &
Your Cloud Shell may freeze, try reconnecting after a few minutes

Set your Project ID as Environment Variable

Since we need the Project ID on many occasions, it is easier to set it as an environment variable. Please note, that if you interrupt the shell during this tutorial, you have to set it again. You can find your projects with ID using:

$gcloud projects list

Please run:

$export Project="yourprojectID"

Get your Service Account

You will need the key for a service account quite often. To learn how to create one, you can read this. I prefer the method using the IAM web interface, but there are many ways. For simplicity’s sake, it should be given the role of “editor”, but a finer adjustment is recommended in the long run. After that, carry out the following steps:

Download the key as JSON file
Rename your key to sa.json
put one copy in each directory (4_Setup, 5_NLP, 6_Scheduler)

Your directory should now look like this:

Setup Input-data:
(If you have completed the first part of the project, you can skip this.)

We use Google Cloud Datastore in Datastore-Mode to provide the source data. In order to prepare the Datastore, you have to put it into Datastore-Mode. To do this, simply search for Datastore in the Cloud Platform and click on “Select Datastore Mode”. (if needed, choose a location as well)

After that change directory to 4_Setup and run:

$cd .. ; cd 4_Setup
(You may have to enter this command manually)$python3 Create_Samples.py

If you see “Samples generated”, you have got 20 sample entries in your Cloud Datastore.

Setup Output-Tables
To store the processed data we create several BigQuery tables. We do this using the following bash-script:

$bash Create_BQ_Tables.sh

If all tables were created, we have successfully created all required resources.

5. How to use Spark-NLP

5.1 Overview

The NLP module is located in the repository folder “5_NLP”. Please move to this directory (using the shell). The following files should be in the folder:

Explainer.py
The main script. Here Spark will be started, the pipeline will be created and filled with the pipeline model. The text processing also takes place here.
Model_Template.zip
An example model that extracts entities, i.e. proper names from the texts.
sa.json
Your Google Cloud service account. If you run into 404 or 403 errors, please check the permission granted in the IAM for this service account.
Dockerfile
This file contains the setup for the environment of the script. The exact setup is explained below.
requirements.txt
This file contains required Python libraries, which are installed during the creation of the Docker image.
Explainer.yaml
This file contains information on how Kubernetes should handle the Docker image.

5.2 Spark-NLP in Python

As mentioned above, spark-nlp is a library that allows us to process texts in Spark. To do so, we use the library with the Python script “Explainer.py”. I commented on the code extensively, so I will only cover a few parts here:

First of all, you may need to specify the name of your service account file (if you haven’t stuck to sa.json):

The script’s entry point uses Peter Hoffmann’s Redis class to query the Redis instance regularly for new entries in the task queue. We haven’t set up the instance, so the script will not work yet.

As soon as a task arrives in the task queue, the “Explain” function is called, where the processing takes place.

As you can see, the actual logic is located in the model that is stored in (self.Model). This model contains all important steps for the NLP, like Tokenizer, Lemmatizer or Enitiy-Tagging and is unpacked from a ZIP file with the function Load_Model(). To build a model yourself, please refer to this notebook:

5.3 Build Docker

The Python file needs a working Spark environment. In order to provide this environment, a docker container is created using the docker file. Our docker file looks like this:

A docker file allows us to create a complete system using one file. The most important commands are:

FROM: Sets the Base-Image. A base image can be a native operating system, but other programs may already have been installed on it.

ENV: Spark needs some environment variables to work. With the ENV command those are set for the Docker container.

COPY and WORKDIR: COPY copies the entire parent directory of the docker file into the container and WORKDIR sets this directory as working directory.

RUN: Calls commands that are executed in the Docker Containers shell. Usually used to install applications.

CMD: A docker file can only have one CMD, here the actual Python script is called. The -u operator is important to get logs from the container.

To build the Docker File, please chance the directory to “5_NLP” and execute the following commands:

$docker build --tag explainer_img .
(You may have to enter this command manually)

This command builds the Docker image from the Dockerfile in this directory. We cannot start it yet because the Redis instance is not running, but we have successfully created the image.

To run it later on a Kubernetes Cluster, we have to push the image into the Container Registry. To do so, activate the API by using Google Cloud Platform and search for “Container Registry”. Afterwards, run the following commands:

$docker tag explainer_img gcr.io/$Project/nlp_explainer:latest$docker push gcr.io/$Project/nlp_explainer:latest

You should now be able to see your file in the Container Registry:

If this worked so far, we can now move on to the Kubernetes cluster and get this project to work.

6 Deployment to Kubernetes

6.1 Set up Kubernetes Cluster

This part is quite simple, since Google allows us to create a Kubernetes cluster by command line. You can run this command to create a very small cluster.

$bash Create_Cluster.sh

The creation may take a few minutes. If you want to create a bigger cluster, check out Kubernetes Engine on Google Cloud Plattform. If you created the cluster using the web interface of Kubernetes Engine, you initially need to connect your console to the cluster. You can get the statement by clicking on “Connect”:

6.2 Set up Redis as a Kubernetes Service

First we have to make Redis available on the Kubernetes cluster and register it as a service. This is necessary because the containers run isolated from each other. If a container is registered as service, all containers can connect to it.

To do this, the Redis container must be created from a .yaml file, located in the folder “6_Scheduler”. Run:

$kubectl create -f redis-master-service.yaml

and register it as as service (from another .yaml file):

$kubectl create -f redis-master.yaml

If you take a closer look on the .yaml files, you will see that you can specify all the settings needed. The line “replicas:” is of particular importance, because its value defines the number of parallel instances and therefore the capacity for processing data (of course limited by the underlying machine).

We work on a quite small machine, so we shouldn’t create more than one replica.

If the creation was successful, you should see the following output:

$kubectl get pods

This is the Pods containing Redis

$kubectl get services

And here you can see the service wich provides connectivity to other pods

6.3 Fill Redis queue with tasks

Now that we have set up the Redis instance, we can start filling it with tasks. First we need to establish a local connection with the Redis service. The following command is needed for this:

$kubectl port-forward deployment/redis-master 6379 &

Then we have to install Redis on the local machine:

$sudo pip3 install redis

Afterwards, the Python script can be called. It retrieves the data from the Cloud Datastore, preprocesses it and puts it into the Redis task queue.
You can start the script with:

$python3 Scheduler.py

This script is commented in detail so I will only mention a few points :

The “Process Batch” method contains the actual logic. Here, the articles are read from the Cloud Datastore in batches and passed to the “Send_Job” method.

Since Redis does not like special characters, these are removed to ensure smooth processing.

Then the created jobs are stored in the Redis database with the .put command.

Note: Checking if we need to call a regex method is 10 times faster than the replacement. If special characters are already taken into account when filling the Datastore, the scheduler can work much faster.

You should see the following output:

Don’t forget to terminate the port forwarding afterwards:

$pkill kubectl -9

6.4 Deploy a Docker Container

You have already created the docker container for the explainer in 5.3 and pushed it into the Cloud Container Registry with the name “gcr/[your Project]nlp_explainer:latest”. We will now deploy it on the Kubernetes Cluster.

For this purpose a .yaml file containing all relevant information for Kubernetes is used again.

Please Note, that you have to insert your own container from the registry(image:)!

To push the .yaml file, you just need to execute the following command in the “5_NLP” folder:

$kubectl create -f Explainer.yaml

If you use the small cluster, I recommend to just deploy one replica!
Otherwise you will run into problems caused by lack of performance.

After that you can see the pods using:

$kubectl get pods

The number of pods may differ, according to your Explainer.yaml file

After the creation of the container, you should get this result with the following command (can take up to 2 minutes):

$kubectl logs [pod-name]for example:
$kubectl logs explainer-544d123125-7nzbg

You can also see errors in the log, if something went wrong.

If everything has been processed, the pod stays idle or will be evicted and recreated by Kubernetes (because it uses a loop and does not finish).

To delete all pods without recreation use:

$kubectl delete deployments [deployment]for example:
$kubectl delete deployments explainer

If you don’t want to use the Kubernetes Cluster anymore you should delete it either via the web interface or by using the following command, otherwise Google will continue billing you.

$gcloud config set project $Project
$gcloud container clusters delete [your-clustername] --zone [Zone]for example:
$gcloud config set project $Project
$gcloud container clusters delete "your-first-cluster-1" --zone "us-central1-a"

6.5 Check Results

To check the result, please go to the Google Cloud Platform and search for “BigQuery”.

You should be able to see the following tables in the left bottom:

The tables “Article_masterdata” and “Article_tags” have been created by the Scheduler to serve the needs of the consecutive project. But we want to see the content of the “Entitiy_raw” table.

To access them click on it and move to “Preview”:

You should see the results of the entity recognition with the respective article ID.

And that’s it, you are done!

If you are interested in seeing how to create a highly scalable dashboard based on this data, we would be happy if you read the following tutorial: Build a highly scalable dashboard that runs on Kubernetes by Arnold Lutsch
You could also replace the sample data with real articles using this tutorial: Build a scalable webcrawler for towards data science with Selenium and python by Philipp Postels