When learning Data Science, environment management consistently presents itself as a thorn in my side. Whether it’s getting code that previously ran to work again, trying to share an application I built with the rest of my team, or wondering why none of my command prompts work anymore, I have only learned about programming management whilst cursing under my breath. To save some late nights and headaches, here is a breakdown of what a program environment is and how data scientists can make them work for them instead of against them.
Definition
A program’s environment is the collection of software and hardware in which the program runs. A sentence like that can be a little daunting, but the concept is quite simple. If you tried running a computationally intensive video game on an old computer, the hardware of the computer likely won’t be able to handle the video game. It might be very slow to run or even crash every time you try to play the game. The video game was designed for a program environment with more sophisticated hardware. Alternatively, any Python or R programmer knows you must install a package before you can load it. When you install the package, you add it to the Python / R interpreter’s software environment, so that it can load the code when you want to use it. Issues with a software environment are far more common than hardware issues for a data scientist.
A program’s software environment is just a collection of files that the program can "see."
A program’s software environment is just a collection of files that the program can "see." When you install a package, you are downloading the files from a public location, and you are saving these files in a location on your computer where your Python / R code knows to look. When certain code has a "dependency", it means it needs you to load the depended upon code from these package files into the environment so it can be utilized.
Common Issues And Their Solutions
One of the most frustrating examples of a software environment is the PATH variable. If you want to run Python from the command line, the version of Python that you wish to run needs to be specified in the PATH variable of your operating system. When you add Python to your PATH variable, you are simply telling your computer where the python interpreter executable file is located. If you haven’t added this location to the PATH variable, then your computer won’t find the python interpreter when you tell it to run. This results in a "’python’ is not recognized as an internal or external command" error. Alternatively, if you have multiple python interpreters on your PATH variable (like Python 3.8 and Python 3.9), then it will pick a specific version every single time (based on a set of rules that depend on your operating system), and it may not be the version you were expecting.
The second most frustrating issue with data science program management gets back into package dependencies. When you share your code with somebody else, they likely have different packages installed into their environment than the packages installed in your own environment. This can be a problem if your code depends on a specific package that they don’t have. A solution to this can be to distribute your code with a "package manager" (like CRAN, PIP, or Conda) which will make sure somebody has the exact packages they need in their environment to run your code when they download your package.
To take this problem a step further, what happens if you have an old data analysis which needed one version of a package and another analysis that uses a newer version of that package. To keep a data analysis reproducible, you need to know the code you run will do the same thing every time you run it. As packages continually get developed, they sometimes change behavior, which could cause you to get unexpected results when running an old data analysis with the latest version of a package it uses.
To get around this problem, you can use an "environment manager", something that saves different Programming environments with different packages installed in each. You could then specify which environment is the correct one to run your data analysis in. One of the most popular package/environment managers is Conda (maintained by the company Anaconda). Until they came along, environment management for data science with Python was rather frustrating.
What Is The Secret To The Environment Manager’s Magic?
To clarify again, a programming environment is just a collection of the files that a program has access to. If you need an environment that has version 1 of a package installed and another environment with version 2, then the environment manager needs to make sure both versions of the package are installed somewhere on your computer, and then it will only let the correct version of the package be visible to code running in each environment. When your code tells its environment to load in a package, the files that get imported will be the ones corresponding to the version of the package visible to that specific environment. Each environment should have exactly one version of that package visible to it.
Virtual Machines
Environment managers are critical to the concept of virtual machines. A virtual machine (VM) can be thought of as a computer inside of your computer (Inception style). As with everything mentioned here, a virtual machine is simply a method of file management. The VM is a program that isolates itself from all the other files on the computer; it may even run a different operating system. (For example, a MacOS computer might have a virtual machine running a Windows operating system.)
Virtual machines are very effective when you want to recreate a programming environment on another computer, because you don’t need to know anything about that computer’s current dependencies. You can set up a virtual machine that has exactly the files visible to it that you have on your own computer’s virtual machine. For this reason, data science heavily utilizes virtual machines; they ensure reproducibility of results. Cloud-based applications are another common example of VMs. The developer counts on the fact that their VM will be isolated from any other code living on the server hosting their application.
To specify how the virtual machine’s environment should be configured, you can use a configuration file (a .yml file when using Conda) that specifies all the packages to be installed in the VM when it is being set up. It is now common practice when you share a data analysis that you share the "config" file that sets up the environment to run the code in along with the code itself.
Containers / Docker
When deploying production applications to the cloud, there can be a lot of time spent creating environment configurations and then setting them up on a variety of server computers. To simplify this, containers were born. The company that created containers, Docker, is still the most popular option for deploying containers today, so you will often here the terms Docker and container used interchangeably (kind of like Kleenex and tissue).
Containers are a lot like virtual machines, but they go one less layer deep. Every VM has its own operating system and works like a truly isolated computer. Containers are isolated from everything but the operating system, allowing them to be more "lightweight" (smaller in size because there is less code in each container compared to a VM). Containers can also be coordinated by one program running on top of the operating system (the Docker Engine), which allows developers to build an app in very discrete chunks that get coordinated together instead of one large app that must handle lots of different tasks. It’s like when each of the Power Rangers’ Dinozords sync up into one Megazord, the ultimate fighting machine. Docker’s website has a lot of great material to further outline the distinction and benefits of containers.
So what does this mean for you?
To put all of this information to use as a data scientist, you should:
- Clean up your PATH variable. Google how to do that. Be careful with what you delete from it, though. It will come back to haunt you.
- Fully utilize an environment manager like Anaconda. You and your team should all be using an identical environment when collaborating. Have a standard config file that you all share.
- Turn your code into packages with dependencies. Even if you share these on Github and don’t submit these to a package manager, it will make life infinitely easier when starting a new analysis or sharing code to do so via packages. Even dashboards can be turned into packages.
- If you run your code in the cloud, appreciate the good work that VMs (and maybe even containers) are doing for you.
Conclusion
Understanding data science environments forces data analysts to get out of their comfort zones and deeper into software engineering principles than many wish to venture. However, it’s a necessary evil to confront, as doing so will greatly improve your ability to share your code and save years of your life spent debugging. Fortunately, the concepts are actually quite familiar for data scientists; it’s all just file management.