Understand, Build & Use Docker Images and Containers for Data Science
One of the preliminary steps that you take when you embark on your data science journey is dealing with the installation of different software such as Python, Jyupter Notebook, some IDEs and countless libraries. Once you successfully pass through this, you often encounter situations where your code seems to work fine on your computer, but when you share it with others, it collapses seemingly for no reason. Well, you are not alone!
The good news is that there are some impressive solutions available that range differently on the scale of convenience, accessibility and ease of use. Google Colab is one of them. It is ready to start, comes loaded with lots of useful libraries and has GPU support. It has its limitations too. But it’s not the topic of today’s article. You can learn and experience Google Colab here.
In this article, **** we are going to take a different approach. We will do hands-on first, and the explanation will come later. This approach will demonstrate how easy it is and why you should learn more about it to be a well-versed data scientist.
Below is a list of key concepts & tasks we will be going through today:
1️⃣ What is Docker?
2️⃣ Install Docker Desktop
3️⃣ Run Data Science loaded Jupyter Notebook
4️⃣ Understanding Containers
5️⃣ What is an Image & Dockerfile?
6️⃣ Create a customized Docker Image
7️⃣ Save & share your Image
➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖
Let’s start discussing them one by one.
What is Docker?

Docker is a company that provides solutions (the name of the solution is Docker as well !) to the problems we described above. In fact it does more than that and is an excellent tool in the developer’s toolbox, but we will stick to its relevance to our Data Science-related issues. It is a software that helps us build "images" and run "containers" to support and better deliver data science projects. We will explain the concepts of "images" & "containers" later in this article. As mentioned above, we will do hands-on first, and the explanation will come later. You can read more about Docker here.
Before we begin, I would also recommend creating an account on Docker (free version is fine) to save your projects on Docker Hub.
Install Docker Desktop
Visit this link and install Docker Desktop. It is no different than any other software installation. You can choose the Windows version or Mac Version. For our demonstration, we will use the Windows version. The procedure for Mac is almost identical. Once downloaded, click the downloaded file to install and follow the instructions. The process is straight forward and should go smoothly. If you want to see the process in action before installing, I found a nice 11-minute video on YouTube, that walks you through the installation process.
Once the installation is complete, you can perform a little check to see if everything is working. Open up your command prompt/terminal and write this code and hit enter:
docker -v
This should confirm if the installation were successful by providing you with the version of Docker we installed.

Run Data Science loaded Jupyter Notebook
It’s time for the magic. Open up your command prompt/terminal and write down the following code:
docker run -p 8888:8888 jupyter/minimal-notebook
I will explain this code later, for now just follow these steps. We are doing this for the first time, so it will take some time to get it ready. The next time, rerunning this command won’t take any longer as all the necessary files are already downloaded. Additionally, you will not require any internet connectivity (unlike Google’s Colab) either. Once the process is over, this is how the output should look like:

Copy any of the last three lines (these are the addresses) and paste them into any of your browsers (save the token for later use, provided in the address after the word "token"). That will open up a Jupyter Notebook in your browser.
You will notice that the notebook comes loaded with Python3, Julia & R. Not only that, but many popular data science libraries are already installed and ready to be imported!


The 😯 😲 😯 part is that NONE of these programs are actually installed on your machine!
So if you try to find Python/Julia/R or Jupyter Notebook on your computer, good luck! Even if you had these programs installed previously on your computer, these "dockerized" installations are stand-alone installations that have nothing to do with the applications already installed (or not installed) on your computer.
This implies that if you create a program or write a code using the Jupyter Notebook we just created, test it, and share (all these things coming later in the article) it with your friend or colleague, it is guaranteed to work as long as your friend or colleague has fired up their Jupyter Notebook the same way we did.
Understanding Containers
The story starts with the idea of containers. The idea is not new, though; you might be familiar with the "environment" concept already. It’s almost the same thing. In due course, a data scientist will create and develop many models that will depend upon many libraries or on many other data scientist’s work (we call them dependencies). This will inevitably lead to "contradictions" between all these models as dependencies evolve and grow.
Let’s start with a very generalized example. Imagine you have a lot of 💰💰💰 and decide to build a plant to produce something for commercial use. You build a factory(which is, for now, is just one big hall), and install all sorts of different machines and workplaces. You then hire skilled workers to do the job. When your business is new and is not at scale things are not complicated, and they go smoothly. But as your business grows and competition increases, you decided to add more sophisticated technologies and advanced processes. Here comes the problem; as you adapt to these new technologies, you realize that some of these processes simply can not work under one roof, e.g. one technology requires a dry environment and in contrast, another one works within a more humid one. Same goes with the people, some need a quiet environment, whereas others need to work with loud and noisy machines (or for many other random reasons)

So what is the most intuitive solution that comes to your mind ?
Obviously, you will simply build separate rooms (as opposed to construct separate buildings )and halls to make sure that every process and department gets the environment it needs. This solution, is akin to the concept of a "Container".
Continuing with our example, now you want to build the same facility in another country or geographical area with the exact same settings. Imagine a technology that can somehow clone your existing production setup, and you can simply port it to the required location. That is akin to "sharing the Container".
A more data science related example would be a model you created using version x of sklearn for use-case A. Five months later, sklearn has a new, improved version x+1. You update your library and create another model using the new version ,for another use-case B. Everyone is happy until you run your model for use-case A, which collapses and doesn’t run. It turns out that the old model is not supported anymore by the newer sklearn version. As a result, you can only run the old model (if you go back to install the older sklearn version) or the new model, not both at the same time.
Again, try to think of a possible solution. A straightforward one will be to have another computer and install the new sklearn version on it. This will, for sure, solve the problem. ❌Please don’t suggest this to anyone!❌ , because for how long can you keep this up? Buying new computers every time you update your sklearn library, not to mention hundreds of other libraries, is definitely not a practical solution.
What if we install all the software and libraries required for a specific project, "quarantined" in their own space, within one computer or operating system? This way, your model will only stay within its boundaries and will not interfere with whatever is outside of it.

This is what you call a container; projects and all its dependencies containerized/ isolated /quarantined within one operating system. We can go one step ahead and share that container with others, who can then add and run the container on their machine and execute the same experiment without having to worry about all the dependencies involved.
👉👉👉 This is precisely what we did earlier through Docker, by running one line code in command prompt. The Developers at Jupyter put Python, Julia, R and other libraries in a container, shared it on a cloud, and we, through Docker, pulled and ran it. 👈👈👈
At this point, we are ready to introduce two more concepts to get a complete picture; Image & Dockerfile.
What is an Image & a Dockerfile?
The Image is a seed for the container. It is a "snapshot" of a project at some point in time. It only becomes a container when we "run" it. Think of it as a picture you took of your friend’s beautiful room and later on you plan to arrange your room the same way. The beauty of it all is that we can alter the Image before running and converting it into a container. That alteration comes through a Dockerfile. Dockerfile is a poker face file (it does not have any extension like csv, txt etc.) that contains instructions for the Docker.
In our demonstration, Jupyter took a snapshot of its project, made an image, and published it on Docker Hub with the name "jupyter/datascience-notebook." We pulled and ran the Image, making it into a container. We did not alter the Image, though; we just ran it as is. That’s why there was no Dockerfile involved.
In the next session, we will show an example where we will alter the notebook and add the Tensorflow and a data pre-processing library called preprocess1 before running the container. This time to keep things lighter and faster, I will use another image of Jupyter notebook (called jupyter/minimal-notebook). You can keep the same Image as we used previously, if you want to.
Create a customized Docker Image
Let’s move ahead with our previous example, in which we "pulled" the data science loaded Jupyter Notebook from Docker Hub. Only this time, we will add the Tensorflow & preprocess1 library to it through Dockerfile. But first, let’s see what the process will look like.

📔 If you have and know how to operate MS Visual Studio Code, the process becomes so much easier. For now, we will do the steps assuming we don’t have VS Code.
- 1️⃣ Create Project Folder: On your computer, create an empty folder where you want to save your project’s files.
- 2️⃣ Create Dockerfile: Inside the folder, create a text file. Open the file and write down the following code (lines starting with # are mere comments)
# import pre-built image from data science notebook
FROM jupyter/minimal-notebook
# Install required libraries
RUN pip install tensorflow
RUN pip install preprocess1
As you can see, the code so far is easy to read and understand. We did nothing more than importing the Jupyter Notebook and installing the libraries we wanted by using "FROM" & "RUN" commands.
- 3️⃣ Save the Dockerfile: Close the file, right-click to rename it as "Dockerfile" (make sure to follow the exact spellings and case) , remove the extension (.txt) and save it. Your Docker file is ready!
➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖
➕ Bonus: Usually, we do all these library installations through a text file called "requirements" (although the name doesn’t matter). You just put the library names along with any specific versions you want and save the file under the same folder where your Dockerfile is. Library requirements in the file go like this:
tensorflow==2.2.0
preprocess1==0.0.38
Here is the code to "add" this text file to your Jupyter Image.
➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖➖
- 4️⃣ Fire up your command prompt, and ‘cd’ to the folder where we created the docker file. Once there, we run the following command to build our Image.
docker build -t notebook_demo .
Again, code is pretty intuitive; it is a docker command to build an image named notebook_demo. "-t" is a flag word used to name/tag the image. In the end, the "dot" means that the Dockerfile we are using is inside the active directory. Hit enter to execute the command line. It should look like this:


- 5️⃣ Run the Image: we are now ready to run the Image we just created. Run the following command in the command prompt / terminal:
docker run -p 8888:8888 notebook_demo
This command means it is a docker command to run an Image named notebook_demo. Flag word "-p" means to map it to ports 8888 between the container and the host.
Once done, copy any of the three addresses provided and past it in your browser to open and access your Jupyter Notebook (save the token for later use). Import tensorflow & preprocess1 libraries to make sure we have them correctly.


We successfully loaded a minimal Jupyter Library, pre-installed with some custom libraries like TensorFlow and preprocess1.
Save & Share Your Image
What good is it if we cant port our work to other computers or share it with other users ? That is out last step! It is worth noting that once you build an image on your computer, it is automatically saved. You can use the run command to access it any time, without needing any internet connection.
▶️ One option is to save it to the Docker Hub. That makes it public, and anybody can access it just the way we accessed minimal and data science Jupyter Notebook from Docker Hub. One of the issues with this approach is image size when you upload it to Docker Hub as it increases rather quickly.
we will need to use the following commands to save to Docker Hub in command prompt/terminal:
#log into Docker Hub
docker login - username= your_Docker_username
#Enter your password when prompted
# Push your Image by name, in our case "notebook_demo"
docker push your_Docker_username/your_image_name
That’s it! your Image is available to use publicly. Anyone who has docker installed , can easily access the same image, with the exact dependencies that you added (tensorflow & preprocess1) with the command we are already familiar with:
docker run -p 8888:8888 notebook_demo
▶️The other option is to save it as a tar file , we need to run the following command in command prompt/ terminal. You need to be in the same directory where you want to save the tar file.
# save it under the active directory
docker save notebook_demo > notebook_demo.tar
# you can load it this way
docker load - input notebook_demo.tar
Congratulations! If you have made so far, you deserve a big round of applause!
👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏

👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏
Summary
In this session we learnt and applied the following
✅ What is Docker? ✅ Installed Docker Desktop ✅ Ran Data Science loaded Jupyter Notebook through Docker ✅ Understood the Concept of Containers ✅ Learnt about Docker Image & Dockerfile ✅ Created a Customized Docker Image✅ Saved & shared Docker Image
Remember, we just scratched the surface. There is so much more learning and explosive material. Images are not only of Jupyter Notebooks. They can be entire operating systems, programming languages and so on. I highly encourage you to take a look at use-cases and references at Docker’s website
Please feel free to give me feedback. After all, that’s how we learn and improve!