The world’s leading publication for data science, AI, and ML professionals.

Plotting decision trees in-memory

Avoid unnecessary slow disk operations in <10 lines

Introduction

Decision trees are inherently interpretable classifiers, i.e. they can be plotted for understanding how and why they make certain decisions. It’s a very important property for use cases of ML where non-technical experts use it. For example, doctors performing disease detection with ML can derive the exact if-else decisions the classifier makes from the plot.

The most widely used library for plotting decision trees is Graphviz. It offers command-line tools and Python interface with seamless Scikit-learn integration. With it we can customize plots and they just look very good. The problem is, Graphviz mostly supports writing to file, and most tutorials just save image to file and then load it.

What’s more, Graphviz performs 2-stage plotting. First it describes the tree structure in the DOT format (https://en.wikipedia.org/wiki/DOT_(graph_description_language)), then plots the tree based on the DOT file. This means even more disk operations.

Why avoid disk operations?

In practice, ML models aren’t run on the client computer. Instead, servers or cloud-based computation engines serve the models using API. Typically JSON with results, e.g. Base64-encoded image, are sent back.

Handling many client requests at a time can put heavy load on disk I/O, and disks themselves may not be fast SSDs, but older hard drives – either for reducing cloud costs or for larger space available.

Typically we plot the tree, send it to the client and delete the image. This means that writing the image to the disk is pure waste. We have to perform 6 operations per image: save DOT file, load DOT file, save image, load image, delete DOT file, delete image. How can we improve this situation?

Plotting in-memory

While Graphviz always writes file-like bytes, we don’t need them to be written to the disk. Instead, we can perform the whole operation in RAM. This means faster memory access and no slow disk operations. The space taken is not a problem, since images are not very large and are almost immediately deleted. We also don’t need to remember to delete files, since garbage collector will do this for us.

It’s easy to avoid using DOT file with Graphviz – with appropriate argument set it returns it as a variable, instead of writing to the file. However, creating an image from it in-memory is a bit harder, since the plotting result is returned as a byte buffer. Fortunately Numpy can create an array from bytes and OpenCV can decode arrays as images. This way we completely avoid using disk operations.

Code and time comparison

Code for both functions (in-memory and with disk operations) is presented below. As you can see, the additional benefit is that the in-memory version is shorter, under 10 lines (excluding whitespace and formatting).

In-memory version:

Disk operation version:

Decision tree plotted by both versions (image by author)
Decision tree plotted by both versions (image by author)

For speed comparison I used the Breast Cancer dataset included in Scikit-learn, repeated 100 times for both code versions. Comparing mean times, the in-memory version is 15% faster! And this is on a fast SSD drive, with almost no I/O operations. When run on the server with heavy I/O load and possible HDD drive, the difference will be much larger.

Summary

In this article you’ve learned how to plot decision trees with Graphviz without disk operations, using just basic libraries and achieving shorter code, under 10 lines. Considerations like this may be important in deploying ML for real-life situations requiring scalability.


Related Articles