Unlimited scientific libraries and applications in Kubernetes, instantly!

Set up a shared library of packages and applications that you can dynamically load into your Kubernetes Data Science environments.

Guillaume Moutier
Towards Data Science

--

Photo by Michael D Beckwith on Unsplash

TL;DR:
Providing Jupyter-as-a-Service on Kubernetes can be tricky at scale, when you need to provide many different libraries or applications, with different versions, to satisfy every user’s needs.

Instead of creating overweight container images, or having to manage (and choose from!) hundreds of them, or having people pip/conda reinstall everything from scratch every time they open a notebook, you can build a shared repository of libraries and applications, and let users load them dynamically and instantly.

Full code and deployment instructions provided for Open Data Hub!

(Screenshot by the author)

Disclaimer: I am working at Red Hat, helping organizations with their Data Science/Data Engineering platforms and solutions. Therefore the implementation described here is using different Red Hat products like Red Hat OpenShift (our Kubernetes Enterprise distribution), or Red Hat OpenShift Data Foundation (our OpenShift-native storage solution based on Ceph). However, the overall architecture and recipes could be adapted for other platforms.

And now, let’s go for the lengthy full story!

Chapter 1: How the trouble began

Kubernetes is a popular choice for Data Science and Data Engineering platforms. Agility, versatility, resource scaling,… make it a platform of choice for those types of workloads. Especially in big organizations with shared environments: packaging your apps, libraries and dependencies into lightweight container images, making them reproducible, consistent and secure seems like the perfect answer to all your problems!

Thrilled with all what you’ve read about this recently, you package a Jupyter container image in this way:

(Graphic by the author)

Because you’re smart and want to anticipate needs (translate: you don’t want to come back to this again and again), you even create some different “flavours” to cover the different use cases you can think of: basic notebook with Scipy only, one with Tensorflow, one with Pytorch,… Perfect, those container images are ready to use.

So you set up JupyterHub or Kubeflow in your Kubernetes environment, and create a Jupyter-as-a-Service environment for all your users! Mission accomplished, problem solved, you can go back to other (always) urging matters.

A few weeks pass…

Then come Alice, Bob, and the rest of the Data Scientists team. They even brought with them some BI people to show their goodwill through the digital transformation the organization is undergoing, and a few Data Engineers (one never knows, there could be a leak to fix in a data pipeline somewhere). But no statisticians, you all know why…

(Image by the author from imgflip.com)

Alice started, “Listen, this Jupyter-as-a-Service thing is fantastic! We’ve been playing with it for the past few weeks and we see how this could replace our personal environments: no more things to install and manage on our laptops, which lack processing power and memory anyway. And everything is accessible from anywhere!”.

“Yes, that is definitely a great improvement.”, continued Bob. “ We are now able to exchange our notebooks easily, as they’re all based on the same container image and use the central data lake. No more of this library or this application missing!”

Somewhat puzzled, as this had never happened before, you asked, “Ok, then, perfect. But why are you here, not only to thank me I presume?”

“Oh, almost nothing really, for you it will be easy-peasy!”, exclaimed Charlie. “You know, the different flavours of environments you have created for us are great. It just lacks this library I’m using a lot, you know myFavLib? At version 3.2.4 or higher of course, because up until 3.2.3 there were some broken things. But not from the new 4.x branch, I don’t really like what they’ve done with it.”

“Yes, and if you could also add LibA, LibB and LibC, that would be great. Oh, by the way, for LibC I will need the last 3 versions, to be able to compare some stuff”, Debbie added.

In his usual condenscending way, Bob jumped in, “Since we are on this subject… Of course all the older versions of the images and the libraries will always be available, right? Because it’s part of our audit process to be able to redo any calculation at any time, you know, for liability purposes. So of course it has to be always done with the exact same tools. Plus I don’t want to have to modify my code to adapt it to newer versions anyway.”

Seeing your face slightly decomposing, Alice cut it off, “Well, I guess you got the idea, so we’ll leave you at it then, thanks again!”

As you are a seasoned professional, you manage to hide the panic rising and quickly switch to solution mode…

Chapter 2: Several options, no real solution….

After a few days, you contemplate the results of your hard head-scratching sessions…

Option A: the bloated image.

OK, let’s put everything they asked into the same container image! That’s easy: just add some lines to requirements.txt, maybe a few more things to the Dockerfile, docker build and… Done! Now let’s upload the image to the repo. What, ETA 2:34:12 ?! Oh… this image is 38.2GB… Definitely not manageable or workable: “Starting container. Pulling image, please come back tomorrow…”. And anyway, I cannot fit in the multiple versions of the same library as asked.

Option B: the never-ending stream of images.

Well, on top of the existing images, I can always create custom ones to accommodate the various demands. But let’s see,… With only 15 libraries and 2 possible versions for each one, that’s a theoretical 2¹⁵ combinations, 32 768! Even if they only ask for a hundredth of this, that’s still more than 300 container images to choose from. I’m pretty sure the dropdown selector for which image to launch won’t even allow that much. And anyway, even if the users cope with it because they don’t have the choice, the whole thing will be unmanageable in the long term.

Option C: let them deal with it.

After all, it’s not that difficult to do a pip install! With some clever mechanism, I could even have a process looking in the notebook’s folder if there is a requirements.txt, and install everything listed in the background. Of course, that happens every time the notebook is launched, which will cause delays and complaints… And anyway, it won’t solve the problem of having many images to maintain with different sets/versions of base libraries. Unless I want the users to reinstall the whole stack from scratch every time…

This approach also does not solve the problem of applications that you cannot install using pip or conda. And being in a secured enterprise environment, containers don’t run as root (which is a best practice), so people can’t install anything anyway.

So what to do…?

Your not-so-good options… (Graphic by the author)

Chapter 3: The depths of despair

Note from the author: this chapter has been voluntarily filtered out as 1) There is no need for everyone to live this, even by proxy. 2) It has no link whatsoever with IT or Data Science.

Chapter 4: This is the way

In the past, I had the chance to get to learn what Compute Canada and lots of people from the HPC world, at least in Academics, especially the fantastic EasyBuild community, were doing to address this very same problem: how do you bring applications and libraries to your users within a shared environment in an effective and reproducible way, making it easy for the data scientists to just do their jobs and not fumble with installations, packages or compilation…?

Their answer was to use Environment Modules to dynamically “load” pre-packaged libraries or applications. Simplifying it to the extreme, let’s say you have access to some shared folders mounted somewhere in your filesystem, the module engine will simply modify your $PATH, $PYTHONPATH or other environment variables like that to make those folders “visible” into your environment. No more Python complaining that it cannot find the library when you do an “import pytorch” for example because now it knows where to find it!

Of course all of this is way more refined because it can keep track of dependencies, what’s loaded or not, what to unload and when in case of cross-dependencies…

But basically, the solution to our problem consists of two things:

  1. Provide access to a library of “modules” that can be Python libraries, Linux libraries, or full-blown applications. This library can be mounted at spawn time inside your environment.

2. Have a way to easily load those modules inside the environment.

Implementation concept in Kubernetes (Graphic by the author)

So this is the solution I have implemented as a proof-of-concept within Open Data Hub (Open Data Hub, ODH, is an open source project based on Kubeflow that provides open source AI tools for running large and distributed AI workloads on OpenShift Container Platform):

  • The shared library is installed on an RWX volume provided by OpenShift Data Foundation (with CephFS behind the scene).
  • ODH is providing Jypyter-as-a-Service environments, in this case with a slightly customized image that is able to load modules, and a JupyterLab extension to do that loadindg/unloading part easily.
  • When Jupyter is spawned, the shared library is mounted read-only inside the pod.
  • You then only have to load what you need, when you need it!

And this is what it looks like:

Quick demo of the environment (Screenshot by the author)

This is what happens in this quick overview:

  • I launch a basic notebook example that uses Torchvision to infer the breed of a dog from a picture (and not just any dog, mine!).
  • As I try to run the notebook, a ModuleNotFound Error happens because I don’t have the torchvision library available in my environment.
  • So on the left panel, in my Softwares extension menu, I look for this package by entering a few letters, then I just click on Load.
  • As the notebook was already launched, I have to quickly restart my kernel to take the change into account, and Bam! I can run the whole notebook.

At this point you may say: “That’s interesting and all, but you could have just done a pip install”.

Granted, especially with this basic example, and this is what we’re used to do, especially on local environments (opposed to shared ones). But this alternative method brings interesting capabilities, all of them without creating specific container images:

  • Provide applications that people would not be able to install otherwise as they may not be available as a Conda or Pip package. And in this kind of shared environment people should not be able to yum/apt-get install anything as they’re not supposed to have root access. I hope it’s the case in your environment…
  • It gets really easy to tests different versions of the packages/applications. Just load/unload the relevant module(s), no need to do long or complicated installs/uninstalls or juggle with dependencies.
  • You can provide many different versions of the packages or applications, and users can choose what they want according to their needs. Even years later, those modules can still be available, no need to remove them because there is an upgrade of the environment. On the opposite, you can provide bleeding-edge alpha versions of modules without putting the whole environment in jeopardy.
  • Users’ environments stay clean and don’t get cluttered with installed libraries. Plus they don’t have to do all those installations every time they want to run a specific notebook. Module loading/unloading is way faster as in fact all the content is already mounted into the pod.
  • For the system administrators life gets way easier. There is only one container image to manage. Which is why this project has been dubbed Open Data Hub-Highlander, “There can be only one!”.

But wait, there is more!

In the first basic example, I loaded a Python package. But as I said before it can also be a full application, even one with its own Web UI like RStudio-Server. For this to happen, I’ll just use jupyter-server-proxy with some special configuration to make it appear in the launcher when the module is loaded.

Load the module:

Load the module (Screenshot by the author)

A new tile appears for RStudio:

New RStudio tile (Screenshot by the author)

When you click on the tile a new tab opens in your browser and RStudio launched in a few seconds.

RStudio (Screenshot by the author)

And this is the same container environment as before, meaning you have access to the same files, same tools,… Which can be useful when you want to mix R and Python scripts and vice-versa!

Some more details on how to use this

In the following repo you will find full deployment instructions and code, including a pre-compiled library of modules that is easy to deploy in your Open Data Hub environment: https://github.com/guimou/odh-highlander

And this is what you will be able to do once you have deployed everything…

In your JupyterLab instance you have access to a new extension, “Softwares”. The list of available modules is splitted in two sections. The “featured” modules, and the full list (this is only the modules I chose to provide in this demo environment):

Module list (Screenshot by the author)

You can use the filter box to search for a specific module (just enter a few letters). Filtering happens simultaneously on both lists:

Filtered list (Screenshot by the author)

If you click on a module name, a pop-up will give you more information: description, dependencies,…​

Module info (Screenshot by the author)

To load a module, hover on it and click on the “Load” button:

Module loading (Screenshot by the author)

The module and all its dependencies is automatically loaded (torchvision in this example):

Module loaded (Screenshot by the author)

To unload a module, hover it the “loaded” list, and click “Unload”:

Unload Module (Screenshot by the author)

The module and its dependencies will be automatically unloaded.

Note: Lmod, the module manager engine, keeps track of the loaded dependencies for each module. If two different modules share dependencies, a module unloading won’t affect the other one, its dependencies will still be there. They are only unloaded when no module needs them anymore!

Advanced functions

Collections

If you want to create a specific environment with a set of different modules that you want to use, no need to recreate this from scracth every time! You can simply load those modules, then create a “Collection”. Next time, just restore this collection in two clicks.

To create a collection, load the modules you want, click on the icon and give a name to the collection.

Create Collection (Screenshot by the author)
Name and Save Collection (Screenshot by the author)

When you want to bring back this environment, just click on the Restore icon, and select and load your collection.

Restore Collection (Screenshot by the author)
Restore Collection (Screenshot by the author)

Imports

You can also directly work from your notebooks and your scripts to load the modules that you need. To know which modules you have to use, you can directly export the relevant Python code!

Click on the “Generate Python code” icon:

Generate Code (Screenshot by the author)

You can then copy paste the full code in you first notebook cell or in your script:

Python Code (Screenshot by the author)

Note: Of course, for this to work in your notebook or your script, the container image or environment you are using must be “lmod-enabled”, and the library with the relevant modules must be accessible/mounted into this environment.

What’s next?

As already said, you will find in this repo all the resources and instructions to deploy this solution. And in a follow-up post, I’ll get in more details on the technical implementation, especially:

  • How to create an environment (well, tbh a glorified container image…) to create you own modules.
  • Instructions and examples on how to use EasyBuild to create modules.
  • How to create a Lmod-enhanced JupyerLab container image to use the modules library.
  • How to create hardware-specific modules (like cuda-enabled, that need a GPU), and make them available only on the relevant infrastructure.

Many Thanks

There are a number of people and organizations I have to thank, as I did not invent anything here. I simply mashed up some projects, ported some solutions from one world to another, and voilà!

  • The Open Data Hub team, for creating this fantastic Data Science environment that is so easy to deploy on OpenShift.
  • The EasyBuild Community of course for the great tool they created, but also for their warm welcome (thanks Kenneth!), and the willingness to answer my dum newbie questions (thanks Maxime!).
  • Compute Canada, who made me discover this kind of solution. Their setup is sick, especially the way they are distributing modules across the whole organization, kudos!
  • CMD-NTRF, Félix-Antoine Fortin, for its fantastic Jupyter-Lmod extension, that brings the solution to another level with a slick UI in Jupyter.

References and links

--

--

Hi! I am a Senior Principal Technical Evangelist working @ Red Hat. Containers, Storage, Data Science, AI/ML, that’s what it’s all about!