Running Deep Learning Models is complicated

On the way to frustration

5 min readJul 10, 2020

You discover a new paper that just came out. You look at it and its brilliant. Some kind of new architecture made it possible to improve the SOTA on your specific dataset, that you are interested in by 4%. That is amazing you think and you start to scroll down in search for the magical link to GitHub so you can try it out yourself. You find it, click it and hopefully there is a pre-trained model as well.

You don’t want to train the model from scratch. It would take an eternity on your budget GPU from 2013. Even if you had 4 x V100 available and you could train the model in a few hours, you will probably not get the same results as mentioned in the paper. Reproducibility is a very big problem in Deep Learning (DL).

But it is your lucky day! There is a pre-trained model that you can download. It is even on Google Drive, not on Baidu or something where you have problems finding the download button (happened to me more often than I want to admit). Everything is perfect so far. You will just install some pip packages and run it.

This is where everything falls apart…

Python

In another blog post, I argued that Python is not an ideal programming language for deep learning or numerical computing in general. First of all, it is very slow, compared to something like C or JavaScript. Secondly, the Python Global Interpreter Lock, also called GIL, makes parallelization on the CPU really difficult. If you want to write for the GPU there is no way around CUDA. So one way or another you will leave the Python world.

“If you want to make Python faster, remove the Python part.” I think Jeremy Howard said it and unfortunately, he is right.

This is a big annoyance if you are doing DL. These are the things you care about. Speed and parallelization on the CPU and GPU.

So you are forced to write CUDA extensions in C for your fancy new neural network layer. You can not easily integrate them in PyTorch and TensorFlow and you will need to write extra code to compile and install them. Have you ever seen the setup.py file in a DL repository before? That is why.

But you think this is not a problem for me. The authors of the paper already did all the work. I just want to run it. But because you have to compile C Code you will get errors like this: Segmentation fault (core dumped). You know Python. You are not a C developer. So you have no clue what caused the error. Maybe your CUDA version is wrong or your GCC compiler is not working correctly. Who knows…

And remember we are just trying to run it on a typical Ubuntu machine with an NVIDIA GPU. Don’t let me start if you want to run it on your CPU, mobile platform, or on an embedded device.

So the question becomes, how do we make it easier? How can we abstract these difficulties away?

ONNX

ONNX is a great project that is trying to fix this. The only problem is that the custom layer (the one we had to write custom CUDA code for) is probably not implemented in ONNX yet so you would need to write custom code for ONNX to compile the model or you are out of luck. Dammit, this is getting frustrating.

Docker to the rescue?

The other thing you could try is running it inside a Docker container. There is often a Dockerfile provided in the repository. You can start the Docker Container and open a bash terminal inside it. This is why I still use Visual Studio from time to time. Remote Containers are great, but this is a lot of overhead. Running the Dockerfile for Detectron2 for example takes about 25 minutes and a lot of space on your machine. Better but not great.

TensorFlow/PyTorch Hub?

I used PyTorch Hub many times in the last few months. Because I like PyTorch more than TensorFlow I have to admit that I never used TensorFlow Hub before, but I think they are trying to solve the same problems.

They are both hubs for DL models that abstract away all the inference code you would need, to run them correctly. Something like normalizing the image for example has all been taken care of. I think PyTorch Hub is awesome! If you installed PyTorch correctly it should always work on the CPU or the GPU if available.

In the short example below I I run the new Detr model on my GPU in only 8 lines of code.

It could save you a lot of work.

If you are looking at a PyTorch model, search for a PyTorch Hub integration. It could save you a lot of work.

Visualizing

Ok, after a while you get this new fancy object detection model to run. Maybe you did with Docker or something else. Congratulations! But wait all I get is an N-dimensional Tensor, how can I visualize this? How can I look at the actual results?

If there is no script for visualizing the results you will need to write it yourself. If you are lucky and the model is based on Detectron2 or mmdetection you can use these libraries for visualizing. Maybe this is something one could integrate into Pytorch or Tensorflow itself?

Conclusion

As long as we are using Python to write Deep Learning models we will always have these problems. Do not get me wrong, I love Python. I use it every day and I think it is awesome. But because it is so slow we will always go back to something like C to make it faster.

This is Software 2.0 and we are only doing it for the last few years. We have a lot to figure out. How can we package a DL model. How does CI/CD work in this context. How do I make it easy for others to try out my model? If you are interested in these topics I wrote a blog post a while ago on MLOps that you can find here.

A challenge (just for fun)

If you are bored on the weekend or something I have a challenge for you. Go to Papers With Code (great site btw) and choose a popular Computer Vision challenge like Object Detection. Take any of the best three models and get it to run in under 30 minutes. Extra points for running it without Docker or on the CPU. You automatically win the challenge if you can compile it with TensorRT (no matter how long it takes you). Post your results in the comments below!

— UPDATE —

I created Easy-Model-Zoo, a library that tries to solve some problems mentioned in this post.