Let me tell you the use cases of Docker in Data Science!

“Always walk through life as if you have something to learn, and you will”-Vernon Howard.

Published in

Towards Data Science

6 min readMay 20, 2020

We always strive to learn new things in life, whether it is learning from experience or learning new technology to be up to date. Taking that into consideration today we will learn something new. Don’t worry I am not going off the line. Many of you would have thought what after data analysis and model building?. So, enough of data analysis and machine learning model building, today we will learn something new but a lot related to it.

In this article we will learn about the use of Docker in data science and its application in the field.

Many of you would be going to work with docker for the first time. Don’t worry I will start from the very beginning.

Main points to discuss :

What is Docker?
Why do we need docker?
Its use cases in Data Science.
Conclusion.

What is docker ?

Docker is an open-source technology that helps us to achieve operating system-level virtualization. It consists of a set of tools which help us to pack any software or application in a package called container such that it can run on any host system.

There isn’t anything like it is used in software industry only, it can be used where code transfer stuffs happens.

It is sometimes taken as an alternative to Virtual Machines. We can say, if virtual machine is a hotel then docker is a room in that hotel.

It consists of three things command line, docker server and docker hub.

It is from command line one gives instructions to Build, Run and Ship the images. Docker Server controls the images and its resources while the docker hub is a repository platform where one hosts their containers.

Why do we need Docker?

You all would have experienced one thing, that if any software or application runs on a platform(by platform I am referring to Operating System), it isn’t necessary to be running one all other. This arises due to the difference in low-level hardware architecture or software incompatibility. To deal with this issue, came the concept of a virtual machine. Virtual Machine is nothing but a discrete isolated environment that contains all the necessary requirements to run an application. They sit on the top of the Operating system which controls them with the help of a software layer called a hypervisor. But the issue here is that the memory and computation power allotted to one virtual machine cannot be shared by others and a single virtual machine has high demand for the same. Therefore this consumes a lot of our computer’s resources and leaves our PC doomed.

Comparison between Docker and Virtual Machine : Source: My Gallery

So to cope up with that effect, came the concept of Docker. Docker can be taken as a light virtual machine but a lot different from it. It occupied less space and one container’s memory can be shared by the other when not in use. This reduces a lot of load from PC and this leads to everything running smoothly.

Now you all would be thinking what is the container?
I have one example to make that easily understandable. Suppose you bought a new house and you need to shift your furniture and stuffs from the old house to the new one. What you will do?. You can’t take one item at a time and shift. This will take a lot of time and they may also be some possibility of you forgetting something in the way. So you got an idea of packing all the things in a box and shifting them all at once, no chance of forgetting anything back. Now take that box as a container the contains all codes and dependencies in it and houses as an operating system. No I can take everything to be cleared.

Use cases of Docker in Data Science.

Let me give you an analogy. Suppose you and your friend are working on a Data Science project. Your friend developed some models, and want you to have a look. Now what would you do? I am pretty sure you would look for whether the project’s dependencies (libraries) are installed in your system or not. If not you would install all of them and then run the code.

Not here comes the magic of Docker. Using Docker your friend can share the Image of the project created using docker. Which you can run on your PC, and you just have to run it directly in your PC without installing anything.

Further if you are well worse with Flask you can first create a web app with all features and create its image. Now if you run its image it will get automatically hosted on your PC with just a command without doing anything else.

In most of the cases after model training and development cycle by Data Scientist, the model is passed to the Software Engineer for the deployment purpose, without docker the software guy has to install every dependencies required for model testing in the new environment which makes the task appear quite tedious. Here’s how docker comes to rescue, where the data scientist just have to share the image in which predictive model along with its dependencies already resides. Deployment guy has to just run it in the testing environment with docker installed.

Container<->Images, what’s the difference?

For now take containers as a running instance of a docker image.

There are few terminologies associated with this Docker.

1. docker file: This file contains the step by step commands for creating docker image.
2. requirements.txt: This file contains all the dependencies required to create and run the project.
3. images: This is an isolated instance of your project which contains all the dependencies and code. This is a shareable stuff that you can share.
4. containers: This is a running instance of your docker image.

To install docker visit here.

Basic and most used commands for Docker.

Start: Starts any stopped container.

$: Docker start <image_id>

2. Stop: Stops any running container.

$: Docker stop <image_id>

3. Build: Builds an image form a docker file.

$: Docker build <Docker file name>

4. Pull: Pulls any image or a repository from hosted platform.

$ docker pull ubuntu:14.04  # ubuntu is a image at docker run

5. Push: Pushes an image to the docker hub.

$ docker push user_name/image_name

6. Commit: Creates a new image form the container’s change.

# docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]
$ docker commit c3f279d17e0a  user_id/repo_name)

7. Images : Displays a list of all the images on the system.

$ docker images # Followed by # -q for only image id and -a for each   details of image

Conclusion..

In this article we discussed about Docker, their need in the data science field, how to use them, and how it is different from any virtual machine. Also along with those things we learned some basic and most used docker commands.
We will work on the implementation of Docker on a data science project in the very next article. So follow me and stay tuned!

Visit my LinkedIn profile if you want to connect with me.

Thank you for the cooperation.