Docker image for NLP

A ready-to-use solution

Vitaly Davydov
Towards Data Science
3 min readJan 14, 2019

NLP is one of the main directions of our work at Poteha Labs. We do text analysis, chatbot development and information retrieval. Therefore, we regularly use Flair, Natasha, TensorFlow and Pytorch, NLTK, sometimes encountering languages other than English. Existing solutions are not always suitable for every problem we face: some are difficult to launch, others are too complicated and heavy.

Consequently, we’ve compiled our own Docker image with all the convenient frameworks for NLP, including deep learning. It suits almost 80% of our tasks and saves time for installing the frameworks.

Current approaches

Generally, the majority of data science solutions are now deployed in one of the following ways:

One doesn’t simply deploy on bare metal
  1. On bare metal. One has to spend a lot of time setting CUDA with cudNN and installing drivers for Ubuntu and, having succeeded, make plenty of attempts to run it all together. Something will definitely go wrong.
  2. Or using Docker (a simpler approach). However, regular Docker won’t work well. Needed a specific one: the one in which everything related to GPU is preconfigured. There are already several ready images available at Docker Hub which are prepared with different CUDA, cuDNN and other modules’ versions.

How to deploy GPU in Docker? Firstly, one has to choose a basic image which suits his graphics card particularly (for this, search by tag at hub.docker.com/nvidia). When using GPU everything is anyway inherited from Nvidia images of the required version (for CPU, on the contrary, any convenient image can be used). Then, having inherited from the basic image, one creates his own basic image and runs it. The whole image is going to weigh around 3Gb, however, everything will work fine.

The solution

Having undergone all these difficulties, we, in turn, have created a production Docker image for NLP (source code) which is available and free. A ready image is at our Docker hub. It contains a handful of modern NLP frameworks including those for deep learning. What is inside from the box: torch, flair, spacy, dateparser, pymorphy2, yargy, natasha, nltk, yake (versions). More detailed:

  • flair is a state-of-the-art nlp module, which provides convenient NER, PoS, sense disambiguation and classification.
  • natasha is a module for NER in Russian.
  • yargy is a Russian language parser.
  • yake — is an automatic keyword extractor. Its main features are: unsupervised approach, corpus, domain and language independency.

Four simple steps to install the image:

  1. Clone the repo
  2. Then build docker build -t nlp-cpu -f ./Dockerfile.cpu . (or docker build -t nlp-cpu -f ./Dockerfile.gpu .)
  3. And use: docker run -it — runtime=nvidia iwitaly/nlp:gpu nvidia-smi
  4. You can also pass CUDA_VISIBLE_DEVICES environment variable.

With the help of this Docker image, you can save time on deployment by quickly launching it and start with NLP directly. We hope that it will simplify processes at least a little.

If you have any questions about the installation, please, leave your comments here or contact me directly. Also, feel free to fork the repo and modify the original files.

Links

Thank you for reading! Please, ask your questions, leave comments and stay tuned!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Responses (1)

What are your thoughts?

HI Vitaly - great article. Just out of curiosity, will this approach (with your Docker solution) work on M1 silicon (for R&D / EDA)? Or have you tuned it to a strict GPU / fixed memory environment?

--