Introduction
I have come across many articles on decision tree machine learning algorithms in Python across various mediums but they have always left me wanting more.
They either seem to leap in part-way through the process, or the code does not work when I apply it to my data or they omit important parts out of the process.
As I could not find anything to completely fit the bill I thought I would have a go myself which spawned the idea for this article.
Background
If anyone needs an introduction or a refresher as to how decision trees can be used to predict classifications within data then this excellent article by R2D3 provides a step-by-step visualisation as to how it works and it is absolutely brilliant!
http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
If you would like to deepen your understanding to include some of the maths and how entropy works then this book by Foster Provost and Tom Fawcett is a staple and it explains the more complex topics in a way that is easy and enjoyable to follow –
https://www.amazon.co.uk/Data-Science-Business-data-analytic-thinking/dp/1449361323
Beyond these resources there is a vast amount of further reading on https://towardsdatascience.com/ and https://medium.com/.
Getting Started
There are quite a lot of external libraries involved with developing and visualising decision tree models so let’s get started by importing them and setting a couple of configuration parameters …
Importing and Shaping the Data
The dataset I have chosen to work on is the red and white wine datasets from the UCI Machine Learning Repository. It is an excellent dataset to use because we can look at the data representing wines and then train a decision tree model to predict if that wine is white or red.
The dataset can found here … https://archive.ics.uci.edu/ml/datasets/wine+quality
… and here is the reference in line with the UCO citation policy https://archive.ics.uci.edu/ml/citation_policy.html …
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
To begin with we are going to read the red and white wine data which comes in two different comma-separated files, set a label or target variable of 1=red and 0=white and then produce a single DataFrame
from those two sources and take a peek at the data …

Next we are going to take a look at the relative number of white and red wine in our dataset …
0 0.753886
1 0.246114
Name: label, dtype: float64
So our wines are 75.4% white and 24.6% red. That sounds about right based on a glance at the supermarket shelves but it needs reshaping for the purposes of our decision tree machine learning algorithm which is going to predict the wine colour (red or white).
For these types of algorithms to work optimally the proposition of the target variable needs to by balanced and ours is 75.4% / 24.6%.
A common technique is "down-sampling" where the smaller proportion is used in its entirety and the same number of data points from the larger dataset are randomly sampled.
In the code below we are simply taking all of the red wines and then randomly sampling the white wines to match the number of red …

Let’s just double-check the proportions to make sure we have achieved a balance …
1 0.5
0 0.5
Name: label, dtype: float64
Splitting the Data into Training and Testing Datasets
This next code block copies all of the "features" (acidity, chlorides etc.) into X and the "target" variable (label red / white) into y.
It then splits X and y into training and testing data. The training data will be used to train the model and the test data will be used to evaluate it.
The reason for this separation is to keep the test data out of the training so that the model to avoid the "over-fitting trap" where the model performs well against the training data but poorly against new data it has never seen before.
Here goes …
Tuning the Model
Before we go ahead and create our decision tree model we need to tune the "hyper-parameters" based on the values that best match our red and white wine data i.e. we need to select parameters for the model that give it the highest accuracy in terms of making its predictions.
If we did this manually we could just keep on tweaking the settings and re-running the model and the score but that is very time consuming and not very scientific.
Instead we can use the handy sklearn.model_selection.GridSearchCV
function which will try out all of the combinations of hyper-parameters we pass to it and let us know what the most effective ones are.
I have encapsulated this in the dtree_grid_search
function which accepts an X
parameter containing all of the features, a y
parameter containing all of the labels and nfolds
which says how many splits of the X and y data should be used to test the hyper-parameters.
Here it is ..

Creating the Decision Tree Model
We can see the parameters the grid search has identified as "best" or "most accurate" including criterion = 'entrophy'
, max_features='auto'
etc.
Please note that I opted for max_depth=3
rather than the optimised value of 9 to reduce the number of levels in the tree so that it is easier to visualise for the purposes of this article.
I consciously chose to trade accuracy (it turns out to be about ~93% accurate instead of ~95% at 9 levels) for simplicity and readability which is a perfectly valid thing to do and it turns out that model designers perform these trade-offs all the time when building their models in the real world.
The next block of code builds our decision tree model using the optimised hyper-parameters, trains the model using the training data and then scores the effectiveness of our model using the training data concluding that our decision tree model can predict the target wine colour based on the values of the wine features with an accuracy of 92.8% …
0.9275
Feature Importance
It is worth expending 2 lines of code to take a quick look at the relative importance of the features ..

It turns out that total sulfur dioxide
is the most important feature with an importance of 0.6. volatile acidity
and chlorides
follow at 0.19 and 0.17 respectively. Below that the rest of the features have a low score for importance.
This shows us what features are influencing the model and there might be a case for stripping out the features below chlorides
and re-running the model. If the loss is accuracy is negligible but the decrease in complexity is significant then removing them might be a good trade-off.
Also, even if we reject our decision tree and explore a different machine learning algorithm like a Logistic Regression the information about which features are influential will still come in useful.
Visualising the Decision Tree
Turning the decision tree into an intuitive visual representation is a key step in helping us to understand how it works and what it is doing.
There are losts of libraries out there for visualising decision trees and here I have represented some of my favourites.
I begin with dtreeviz
which can be installed by running the following command line –
pip install dtreeviz
… and you can review the full documentation here – https://github.com/parrt/dtreeviz
I have found many examples online that plot the Penguin dataset that comes with the Seaborn package but I struggled to find any examples that helped me to use it on my own DataFrames.
After a bit of playing around I came up with the code below which will work on any X features DataFrame
and y target Series
.
In my opinion dtreeviz
provides visually stunning representations not only of the model but also of the distributions of the feature and how the algorithm arrived at its decisions to fork the tree and partition the data –

The dtreeviz
really is beautiful isn’t it! Reading from the top the decision tree machine learning algorithm chose make the first data split based on total sulfur dioxide less than 74.50
.
That caused the split of data we see in the second row and we can easily see and understand the remaining splits until the algorithm finishes at a depth of three with 3 groups classified as white wine and 5 as white.
However, although dtreeviz
is visually stunning there is some information missing that is included in some of the tree visualisation libraries that have been around for a while.
For example graphviz
whilst not as visually attractive does include additional information about each nodes entrophy (i.e. impurity), and the number of each target variable (red and white wine) that occur in each node.
Therefore my preference is to use both libraries with two versions of the tree visualisation which I find helps me to understand what the model is doing completely.
Please note that I have added a few extra lines of code below to write the tree image out to a .png
file and then read it back in. Where as it is perfectly possible to show the tree without saving it in fewer lines of code it is not possible to control the scale unless you save it and the default scale is too large to view easily, hence my choice to write it to a file …

Using the Model to Make Predictions
When I first learnt decision trees I was left wanting to know how I would use the model to make a single prediction against a new set of data and I struggled to find an answer online.
What if I have just one bottle of wine whose colour I want to predict? How would I do it?
It turns out to be fairly straightforward and the function below will take one set of wine data, make a prediction and then return the result …
The two lines of code below are just picking the top and bottom rows of data from the full dataset so we can run a single prediction and see if it matches the actual label …


Here we go then. We are calling the single prediction function with the values from the first and last rows and we can see that in each case the prediction matches the actual label (1="red", 0="white") and this is all we need to know to use out model to make a single prediction.
('red', 'white')
Another thing we might want to do is to make a prediction for an entire file or an entire DataFrame
and that turns out to be very easy too.
The code below puts the X_test and y_test data back together to show the set of data with the actual labels and then calls the model to make predictions for each row.
It then displays the data where we can see the actual label and the predicted label side-by-side .

We can also verify the accuracy of our model using two slightly different methods.
The first is expressing those rows where the label
matches the predicted_label
as a percentage of the number of rows which gives 92.3% accuracy.
The model.score
is then run which just goes to show exactly what this library function does and how it works. It is just counting the percentage of rows where the actual and predicted values match …
(0.9275, 0.9275)
Deploying and Implementing the Decision Tree in a Production Environment
Once all the Data Science work is over the need inevitably arises to implement it in a production environment and that environment may not be Python and may not have access to all the specialist modelling libraries we have used in this example.
The tree visualisations have given us an understanding of what the algorithm is doing but fortunately the simplest visualisation technique – a text representation – spits out the tree in a format that can be directly translated into a series of if
else
statements that can be easily encoded in any programming language.
In my environment the production systems are written in C# or VB Script so it is critical to be able to carry out this last step to "operationalise" all that data science work.
The sklearn.tree.export_text
method can be used to give us a super-useful representation of our model …

And this output can be easily converted into a function which could just have easily been written in C#, VB Script, Java or any other Programming language …
Just to finish off our operationalised function is tested below to verify that the labels are predicted in exactly the same way that they were using the single row or whole DataFrame methods above …
('red', 'white')
Conclusion
We started with some online resources that explain how decision trees work from first principles before extracting and shaping a public dataset to use as an example.
We then split the data into training and testing, created an optimised decision tree model, fitted our data to it and evaluated the accuracy of the model.
We then visualised our decision tree using two different powerful libraries and made predictions for individual cases and for an entire data file.
We finished by showing how the decision tree model can be converted into an if-then-else algorithm that can be easily deployed into a production environment even if that environment uses different programming languages for example C#.NET, VBScript, Java etc.
The full source code can be found here: