Running Spark NLP in Docker Container for Named Entity Recognition and Other NLP Features

Using Spark NLP with Jupyter notebook for natural language processing in Docker environment

Yuefeng Zhang, PhD
Towards Data Science

--

Photo by Author

As described in [1], natural language processing (NLP) is a common research subfield shared by many research fields such as linguistics, computer science, information engineering, and artificial intelligence, etc. NLP is concerned with the interactions between computers and human natural languages in general and in particular how to use computers to process and analyze natural language data (e.g., text, voice, etc.). Some of the major challenges in NLP include speech recognition, natural language understanding (e.g., text understanding), and natural language generation.

One of the early applications of machine learning in text understanding is email and message spam detection [1]. With the advancement of deep learning, many new advanced language understanding methods have been published such as the deep learning method BERT (see [2] for an example of using MobileBERT for question and answer).

The other popular method in NLP is Named Entity Recognition (NER). The main purpose of NER is to extract named entities (e.g., personal names, organization names, location names, product names, etc.) from unstructured text. There are many open source NLP libraries/tools with NER support such as NLTK and SpaCy [3]. Recently Spark NLP [4] gets more and more attention due to its more complete list of supported NLP features [5][6].

It seems to me that the development of Spark NLP [4] is based on Ubuntu Linux and OpenJDK. Thus it’s straight forward to setup environment for Spark NLP in Colab (see instructions and code examples) since Colab uses Ubuntu operating system. However, I noticed that it’s difficult to set up a local environment on Mac for Spark NLP due to the following known exception:

To avoid the problem, this article demonstrates how to set up a Docker environment [7] to run Spark NLP for NER and other NLP features in a Docker container. Such a Docker environment can serve as a basis for establishing a Spark NLP microservices platform.

1. Introduction to Docker

As described in [7], Docker is a tool that allows us to easily deploy applications (e.g., Spark NLP) in a sandbox (called container) to run on any Docker supported host operating system (i.e., Mac).

The basic concepts of Docker are:

  • Dockerfile:
  • Docker image
  • Docker container

1.1 Dockerfile

A Dockerfile [7] is a simple text file that contains a list of commands (similar to Linux commands) for creating a Docker image. It’s a way to automate the Docker image creation process.

1.2 Docker Image

A docker image [7] is a read-only template that contains a set of instructions for creating a Docker container that can run on the Docker platform. It provides a convenient way to package up applications and preconfigured server environments.

A Docker image is built from a Dockerfile.

1.3 Docker Container

A container is a standard package of software that includes code and all its dependencies so the application can run quickly and reliably from one computing environment to another. A Docker container [7] is a lightweight, standalone, executable package of software that includes everything for an application such as code, runtime, system tools, system libraries and settings.

A Docker container is built from a Docker image.

2. Setup Docker Environment for Spark NLP with Jupyter Notebook

The procedure of setting up a Docker environment for running Spark NLP with Jupyter notebook consists of the following steps:

  • installing Docker
  • signing up in Docker Hub
  • creating Dockerfile
  • building Docker image
  • starting Docker container
  • pushing Docker image
  • Pulling Docker image

2.1 Installing Docker

The instructions for installing Docker on different platforms are available online: Mac, Linux and Windows.

Once the Docker installation is finished, we can use the following docker command and corresponding output to verify the installation:

docker --version
Docker version 19.03.8, build afacb8b

2.2 Signing Up in Docker Hub

Similarly to Github for sharing source code files, Docker Hub is to share docker images. In order to share a docker image on a local machine, the docker image on the local machine needs to be pushed to the Docker Hub server so that other people can pull the docker image from Docker Hub.

It is required to go to Docker Hub to sign up first before the Docker Hub services can be used.

2.3 Creating Dockerfile

In order to build a new Docker image, a Dockerfile needs to be created first.

To simplify the process of running Spark NLP workshop, John Snow LABS provides a Spark NLP workshop Dockerfile for running the workshop examples in Docker container.

In order to build a new Docker image for running Spark NLP with Jupyter notebook in a Docker container, I created a new Dockerfile [8] based on the Spark NLP workshop Dockerfile with the following modifications:

  • removed tutorials and related notebooks and data files
  • replaced Spark NLP 2.4.5 with Spark NLP 2.5.1
  • adjusted docker hub username
  • adjusted the home directory name in docker container
  • added the command line volume option to map the current working directory on the host machine to the home directory in Docker container
  • removed Jupyter notebook configuration

2.4 Building Docker Image

With the new Dockerfile [8], a new Docker image can be built as follows:

docker build -t zhangyuefeng123/sparknlp:1.0 .

Once the build is done, the following docker command and Docker image tag should show up:

2.5 Starting Docker Container

Once the new Docker image is ready, the following command can be used to start a new Docker container to run Spark NLP with Jupyter notebook:

docker run -it --volume $PWD:/home/yuefeng -p 8888:8888 -p 4040:4040 zhangyuefeng123/sparknlp:1.0

If everything goes smoothly, the following output should show up:

2.6 Pushing Docker Image

In order to share a Docker image (e.g., zhangyuefeng123/sparknlp:1.0) on local host with others, the image needs to be pushed to Docker Hub as follows:

docker push zhangyuefeng123/sparknlp:1.0

The following is the result of pushed Docker image in Docker Hub:

2.7 Pulling Docker Image

Rather than building a new Docker image from Dockerfile, if a Docker image (e.g., zhangyuefeng123/sparknlp:1.0) with the expected functionality already exists in Docker Hub, then it can be pulled onto local host for reuse as follows:

docker pull zhangyuefeng123/sparknlp:1.0

3. Running Spark NLP with Jupyter Notebook in Docker Container

Once a new Docker container is running (see Section 2.5 for details), we can copy the generated URL like below and paste it into a Web browser to start the Jupyter notebook Web interface:

http://127.0.0.1:8888/?token=9785e71530db2288bc4edcc70a6133136a39c3f706779554

Once the Jupyter notebook starts, we can use it as usual (see next section for details).

4. Using Spark NLP for NER and Other NLP Features

To verify that the Jupyter notebook running in Docker container has the same expected functionality, I created a new Jupyter notebook spark-nlp-docker-demo.ipynb and used it to execute the major pieces of code in [6] for applying Spark NLP to NER and other NLP features.

As a start, the following code imports the required pyspark and Spark NLP libraries and then start a Spark session for running Spark NLP on Spark:

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
spark = sparknlp.start()

The train and test datasets of the official CoNLL2003 dataset are downloaded for demonstration purpose:

from urllib.request import urlretrieveurlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.train',
'eng.train')
urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.testa',
'eng.testa')

The following code is to read the training dataset and show the first 500 records. The training dataset follows the standard CoNLL2003 format for the annotations used in training sets to train NER models.

with open("eng.train") as f:
c=f.read()
print (c[:500])

The training dataset can be loaded in a more readable format:

from sparknlp.training import CoNLLtraining_data = CoNLL().readDataset(spark, './eng.train')
training_data.show(3)
training_data.count()

The following code loads a pre-trained BERT embeddings model and uses it to transform the testing dataset into BERT embeddings format (i.e., encoding each word as a 768-dimensional vector).

bert_annotator = BertEmbeddings.pretrained('bert_base_cased', 'en') \
.setInputCols(["sentence",'token'])\
.setOutputCol("bert")\
.setCaseSensitive(False)\
.setPoolingLayer(0)
test_data = CoNLL().readDataset(spark, './eng.testa')test_data = bert_annotator.transform(test_data)
test_data.show(3)

The code below shows the tokens of sentences, the corresponding BERT embeddings, and the corresponding labeled NER tags.

test_data.select("bert.result","bert.embeddings",'label.result').show()

The following code saves 1,000 records from the testing dataset into a Parquet file first and then creates a Tensorflow based character level CNN-DLSTM model NerDLApproach, forms a pipeline using the trained BERT embedding model bert_annotator and the NerDLApproach model, and finally trains the pipeline with 1,000 records from the training dataset and the saved 1,000 testing records in the Parquet file.

test_data.limit(1000).write.parquet("test_withEmbeds.parquet")nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "bert"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(1)\
.setLr(0.001)\
.setPo(0.005)\
.setBatchSize(8)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setTestDataset("test_withEmbeds.parquet")
pipeline = Pipeline(
stages = [
bert_annotator,
nerTagger
])
ner_model = pipeline.fit(training_data.limit(1000))

The trained pipeline can then be used to predict the NER tags of the testing dataset (see the first 20 rows of results below):

predictions = ner_model.transform(test_data)
predictions.select('token.result','label.result','ner.result').show(truncate=40)

The first 20 rows of tokens, labeled NER tags, and the corresponding predicted NER tags can be shown in a more readable format:

import pyspark.sql.functions as Fpredictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).show(truncate=False)

The following code shows how to use a pre-trained pipeline to generate NER tags for a given sentence.

from sparknlp.pretrained import PretrainedPipelinepretrained_pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')text = "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."result = pretrained_pipeline.annotate(text)list(zip(result['token'], result['ner']))

Different pre-trained models can be used to form a new pipeline:

import json
import os
from pyspark.ml import Pipeline
from sparknlp.base import *
from sparknlp.annotator import *
import sparknlp
def get_ann_pipeline ():

document_assembler = DocumentAssembler() \
.setInputCol("text")\
.setOutputCol('document')
sentence = SentenceDetector()\
.setInputCols(['document'])\
.setOutputCol('sentence')\
.setCustomBounds(['\n'])
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
pos = PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
embeddings = WordEmbeddingsModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained() \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
ner_pipeline = Pipeline(
stages = [
document_assembler,
sentence,
tokenizer,
pos,
embeddings,
ner_model,
ner_converter
]
)
empty_data = spark.createDataFrame([[""]]).toDF("text") ner_pipelineFit = ner_pipeline.fit(empty_data) ner_lp_pipeline = LightPipeline(ner_pipelineFit) return ner_lp_pipeline

The following code uses the above function to create a new pipeline and then use it to generate various annotations/tags for a given sentence:

conll_pipeline = get_ann_pipeline ()
parsed = conll_pipeline.annotate ("Peter Parker is a nice guy and lives in New York.")
parsed

5. Summary

Spark NLP [4] gets more and more popular since it supports more NLP capabilities in one system. Spark NLP is developed on Ubuntu Linux system with OpenJDK. I noticed from my experience that it is difficult to set up a local environment for Spark NLP no Mac due to a known exception “Exception: Java gateway process exited before sending its port number”.

To avoid this installation problem, in this article, I demonstrated how to set up a Docker environment to run Spark NLP with Jupyter notebook for NER and other NLP capabilities in a Docker container.

I verified the Docker environment for Spark NLP on Mac using the code examples in [6].

Such a Docker environment for Spark NLP has the potential of being used as a basis for establishing a Spark NLP microservices platform.

Both the Dockerfile and the Jupyter notebook for Docker are available in Github [8].

References

  1. Y. Zhang, Deep Learning for Natural Language Processing Using word2vec-keras
  2. Y. Zhang, Deep Learning for Natural Language Processing on Mobile Devices
  3. S. Li, Named Entity Recognition with NLTK and SpaCy
  4. Spark NLP
  5. V. Kocaman, Introduction to Spark NLP: Foundations and Basic Components
  6. V. Kocaman, Named Entity Recognition (NER) with BERT in Spark NLP
  7. P. Srivastav, docker for beginners
  8. Y. Zhang, Dockerfile and Jupyter notebooks in Github

--

--

Senior Data Scientist at Wavicle Data Solutions, He was a Senior Data Scientist at SMS Assist, a Senior Data Engineer at Capital One, and a DMTS at Motorola