Machine Learning using Julia and its Ecosystem

Using a Decision Tree — Part 2

MLJ is a meta-package for machine learning in Julia. Its first version has been released only in 2019. So its creators could build on more than a decade of experience with existing packages in this field.

Roland Schätzle
Towards Data Science
12 min readFeb 22, 2022

--

Photo by Jan Huber on Unsplash

Overview of the Tutorials

This is the second part of a tutorial that shows how Julia’s specific language features and a variety of high-quality packages from its ecosystem can be easily combined for use within a typical ML workflow.

  • Part IAnalyzing the Glass dataset” concentrates on how data can be preprocessed, analyzed and visualized using packages like ScientificTypes, DataFrames, StatsBase and StatsPlots.
  • Part II “Using a Decision Tree” focuses on the core of the ML workflow: How to choose a model and how to use it for training, predicting and evaluating. This part relies mainly on the package MLJ (= Machine Learning in Julia).
  • Part III “If things are not ‘ready to use’” explains how easy it is to create your own solution with a few lines of code, if the packages available don’t offer all the functionality you need.

Introduction

In this second part of the tutorial a machine learning algorithm will be applied to the ‘glass’ dataset, which we analyzed in part I.

MLJ — Machine Learning in Julia

For this purpose, we use the MLJ package. It is a so-called meta package which provides a common interface and common utility mechanisms for selecting, tuning, evaluating, composing and comparing of a vast number of ML models. At the moment of this writing, it offers 186 models.

MLJ is a project at the Turing Institute in London in collaboration with the University of Auckland and other partners in science and industry. It started in 2019.

The next steps

Using a ML algorithm typically involves the following steps:

  • Preparing the data, so that it gets the adequate structure for training and testing.
  • Selecting an appropriate model which fits the data given and which produces the desired result (e.g. a classification or a clustering).
  • Training the model chosen with a training dataset.
  • Evaluating the trained model using a test dataset.

We will apply this (basic) ML workflow in the following section. We assume here, that our dataset exists in the glass data frame created in part I.

The ML Workflow

Preparing the data

Almost all ML models expect the training dataset to be in a tabular form. So the DataFrame data type which we used in part I is a good starting point.

In our case we have a so-called classification problem: We want the model to predict the correct glass type when given a set of values (namely our features RI and NaBa; the independent attributes). The model will be trained using data for the independent attributes as well as the corresponding glass type which results in each case. This is called supervised learning.

For this sort of training we need the data of the independent attributes in tabular form and the resulting types (in column Type) in a separate list. The function unpack does this job. It splits the glass DataFrame in these two components (XGlass and yGlass). Apart from this it also shuffles the rows using a random number generator which gets initialized with the 'seed' rng = 123.

Checking the scientific type of the resulting data structures with scitype, we can see, that we’ve got indeed a vector and a table:

scitype(yGlass) --> AbstractVector{Multiclass{7}}scitype(XGlass) --> Table{AbstractVector{Continuous}}

Choosing a model

Now we have to choose an appropriate model. The models function gives us a list of all models MLJ offers. There are currently 186 models in this list, which might be a bit overwhelming.

all_models = models() -->186-element Vector{ ... }:
(name = ABODDetector, package_name = OutlierDetectionNeighbors, … )
(name = ABODDetector, package_name = OutlierDetectionPython, … )
(name = AEDetector, package_name = OutlierDetectionNetworks, … )
(name = ARDRegressor, package_name = ScikitLearn, … )
(name = AdaBoostClassifier, package_name = ScikitLearn, … )
(name = AdaBoostRegressor, package_name = ScikitLearn, … )
(name = AdaBoostStumpClassifier, package_name = DecisionTree, … )
(name = AffinityPropagation, package_name = ScikitLearn, … )
(name = AgglomerativeClustering, package_name = ScikitLearn, … )
(name = BM25Transformer, package_name = MLJText, … )
(name = BaggingClassifier, package_name = ScikitLearn, … )
(name = BaggingRegressor, package_name = ScikitLearn, … )
(name = BayesianLDA, package_name = MultivariateStats, … )
(name = BayesianLDA, package_name = ScikitLearn, … )

(name = Standardizer, package_name = MLJModels, … )
(name = SubspaceLDA, package_name = MultivariateStats, … )
(name = TSVDTransformer, package_name = TSVD, … )
(name = TfidfTransformer, package_name = MLJText, … )
(name = TheilSenRegressor, package_name = ScikitLearn, … )
(name = UnivariateBoxCoxTransformer, package_name = MLJModels, … )
(name = UnivariateDiscretizer, package_name = MLJModels, … )
(name = UnivariateFillImputer, package_name = MLJModels, … )
(name = UnivariateStandardizer, package_name = MLJModels, … )
(name = UnivariateTimeTypeToContinuous, package_name = MLJModels,…)
(name = XGBoostClassifier, package_name = XGBoost, … )
(name = XGBoostCount, package_name = XGBoost, … )
(name = XGBoostRegressor, package_name = XGBoost, … )

So we should narrow down the list a bit: First we want to see which models are basically ‘compatible’ with the data types of our dataset. All of our features have continuous numerical values, and the resulting class (the glass type) is a nominal value.

Using matching(XGlass, yGlass) with models filters exactly those models which fulfill this requirement. This reduces the number considerably to 47:

compatible_models = models(matching(XGlass, yGlass)) -->47-element Vector{ … }:
(name = AdaBoostClassifier, package_name = ScikitLearn, … )
(name = AdaBoostStumpClassifier, package_name = DecisionTree, … )
(name = BaggingClassifier, package_name = ScikitLearn, … )
(name = BayesianLDA, package_name = MultivariateStats, … )
(name = BayesianLDA, package_name = ScikitLearn, … )
(name = BayesianQDA, package_name = ScikitLearn, … )
(name = BayesianSubspaceLDA, package_name = MultivariateStats, … )
(name = ConstantClassifier, package_name = MLJModels, … )
(name = DecisionTreeClassifier, package_name = BetaML, … )
(name = DecisionTreeClassifier, package_name = DecisionTree, … )
(name = DeterministicConstantClassifier, package_name =MLJModels,…)
(name = DummyClassifier, package_name = ScikitLearn, … )
(name = EvoTreeClassifier, package_name = EvoTrees, … )
(name = ExtraTreesClassifier, package_name = ScikitLearn, … )

(name = ProbabilisticSGDClassifier, package_name = ScikitLearn, … )
(name = RandomForestClassifier, package_name = BetaML, … )
(name = RandomForestClassifier, package_name = DecisionTree, … )
(name = RandomForestClassifier, package_name = ScikitLearn, … )
(name = RidgeCVClassifier, package_name = ScikitLearn, … )
(name = RidgeClassifier, package_name = ScikitLearn, … )
(name = SGDClassifier, package_name = ScikitLearn, … )
(name = SVC, package_name = LIBSVM, … )
(name = SVMClassifier, package_name = ScikitLearn, … )
(name = SVMLinearClassifier, package_name = ScikitLearn, … )
(name = SVMNuClassifier, package_name = ScikitLearn, … )
(name = SubspaceLDA, package_name = MultivariateStats, … )
(name = XGBoostClassifier, package_name = XGBoost, … )

Well, apart from this rather technical ‘compatibility’ we have a few other expectations to our model: For this tutorial we want a model with high explainability, i.e. it should be easy to understand how it works when classifying some data. And ideally it should be implemented in pure Julia, as it is a tutorial about Julia and ML.

Decision tree classifiers have a high level of explainability. So let’s have a look, if we can find some in this list which are implemented in Julia.

All models in MLJ are provided with comprehensive meta-information (as can be seen in the following example). E.g. the docstring contains a short textual description, is_pure_julia tells us, if it is implemented in pure Julia (or just some other programming language with a Julia wrapper around it) and finally each model has a name. We can use this meta-information to filter our list of compatible models (using Julia’s filter-function, which accepts an arbitrary anonymous function as a filter expression):

filter(
m -> m.is_pure_julia &&
contains(m.docstring * m.name, “DecisionTree”),
compatible_models
)
4-element Vector{ ... }:
(name = AdaBoostStumpClassifier, package_name = DecisionTree, … )
(name = DecisionTreeClassifier, package_name = BetaML, … )
(name = DecisionTreeClassifier, package_name = DecisionTree, … )
(name = RandomForestClassifier, package_name = DecisionTree, … )

So we could reduce our list to four candidates. The first and the 4th entry are advanced decision tree models. But we prefer a ‘classic’ one, which is easy to understand for the purpose of this tutorial. That’s why we focus on candidates no. 2 & 3. The docstring of these models contains an URL to their GitHub repository where we can also find the related documentation: No. 2 ist part of the BetaML package and no. 3 can be found in the DecisionTree package.

Both implement the well-known CART algorithm and are quite similar, but having a closer look at the documentation, the latter one seems to be more mature und offers a bit more functionality. So we choose this one.

Here as an example, for those interested, a full list of all meta-data of candidate no. 3. Such a description is available for each model in MLJ:

Using a decision tree classifier

A model can be loaded in MLJ using the @load macro. It returns a type definition (which we call here MyDecisionTree).

MyDecisionTree = @load DecisionTreeClassifier pkg = ”DecisionTree” 

In the next step we create an instance dc of this type. The instance contains all parameters with their default values. So we can see e.g. that the depth of the tree is not limited by default (max_depth = -1) and that it won't be pruned after creation (post_prune = false).

dc = MyDecisionTree() -->DecisionTreeClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
post_prune = false,
merge_purity_threshold = 1.0,
pdf_smoothing = 0.0,
display_depth = 5,
rng = Random._GLOBAL_RNG())

Training the model

We have now everything to train our model. First we split our data into a training dataset and a test dataset using a simple holdout strategy (in this case a 70:30 split). I.e. we lay 30% of our data aside for testing and use only 70% for training.

partition does this job and shuffles the rows using a random number generator (we initialize its seed to rng = 123) before applying the split. As we already did a vertical split of the data into XGlass and yGlass, these components get splitted separately into Xtrain/Xtest and ytrain/ytest respectively. We inform partition about this special situation by setting multi = true.

(Xtrain, Xtest), (ytrain, ytest) = 
partition((XGlass, yGlass), 0.7, multi = true, rng = 123)

So Xtrain, ytrain contain now 70% of the 'glass' dataset and Xtest, ytest the remaining 30%. The following figure shows, how we’ve split up the dataset into the above mentioned four components:

Glass dataset split up horizontally and vertically [image by author]

Next, we apply the following steps for training and testing:

  • First, the model and the training data get connected using a so-called machine.
  • Then the training takes place by calling fit! on the machine.
  • Afterwards the trained machine can be used to do predictions on new data (using predict).

For the sake of simplicity, we restrict the depth of the resulting decision tree to 3 beforehand.

dc.max_depth = 3
dc_mach = machine(dc, Xtrain, ytrain)
fit!(dc_mach)

Predict and evaluate

So let’s see how well that model predicts on the test dataset Xtest (using MLJs predict-function): yhat = predict(dc_mach, Xtest).

yhat contains now the predictions of our model for the test dataset Xtest. The first three elements of yhat are as follows:

First three elements of `yhat` [image by author]

Perhaps the predictions in yhat don’t look as you might have expected: yhat isn't a list of predicted glass types, but a list of distributions stating the predicted probabilities for each possible glass type. This stems from the fact that the selected model (a DecisionTreeClassifier) is a probabilistic model (as can be seen from its meta-information: prediction_type = :probabilistic). All probabilistic models in MLJ predict distributions (using Distributions.jl).

But it’s easy to get a list of predicted classes (those with the highest probability within each distribution), just by calling predict_mode (instead of predict):

yhat_classes = predict_mode(dc_mach, Xtest) -->64-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
“headlamps”
“build wind float”
“containers”
“build wind float”
“vehic wind float”
“build wind non-float”
“build wind non-float”
“build wind float”
“headlamps”

“build wind non-float”
“build wind float”
“headlamps”
“build wind non-float”
“containers”
“build wind float”
“build wind float”
“build wind float”

On this basis we can check, how many classes have been predicted correctly by comparing ytest (the correct classes from the test set) to yhat_classes (the predicted classes from our classifier).

Note that we want to compare the elements of these two arrays with each other (not the arrays themselves). This can be achieved in Julia using the so-called broadcast mechanism which is denoted by using a dot in front of the function applied (in this case a comparison on equality using ==). So ytest .== yhat_classes does an element-wise comparison resulting in a list of Boolean values. Finally, countcounts the number of true values within this list.

correct_classes = count(ytest .== yhat_classes) --> 44accuracy = correct_classes / length(ytest) --> 0.6875

Using the result of the count we can see, that 44 of the 64 instances in the test set have been predicted correctly. This is a rate of 68.75%.

Print the decision tree

Now it would be nice, if we could see how the decision tree, which has been constructed by our DecisionTreeClassifier, looks like. Then we could e.g. check which attributes have been used for branching and at which thresholds this took place.

Fortunately, there is a built-in function for this purpose. report is a generic function that can be called on a trained machine and delivers (depending on the model used) all relevant information about the trained machine. In case of a decision tree, this 'relevant information' is (among other things) a TreePrinter object, which can be called using print_tree().

report(dc_mach).print_tree() -->Feature 3, Threshold 2.745
L-> Feature 2, Threshold 13.77
L-> Feature 4, Threshold 1.38
L-> 2 : 6/7
R-> 5 : 9/12
R-> Feature 8, Threshold 0.2
L-> 6 : 8/10
R-> 7 : 18/19
R-> Feature 4, Threshold 1.42
L-> Feature 1, Threshold 1.51707
L-> 3 : 5/11
R-> 1 : 40/55
R-> Feature 3, Threshold 3.42
L-> 2 : 5/10
R-> 2 : 23/26

So we can e.g. see that the first branch at the root of the tree uses feature no. 3 (which is Mg = Magnesium) to make a first split at threshold 2.745. I.e. all instances with a Mg-value ≤ 2.745 will be further classified using the rules in the left subbranch and instances with a Mg-value > 2.745 go the right branch.

The output of print_tree() above gives an impression of how the decision tree looks like ... but to be honest, it's a bit rudimentary: It's just ASCII-text and the nodes show only the attribute number, not the attribute name which is used for branching. Moreover, the leaves don’t show the class names, that get predicted at that point. Wouldn’t it be nice to have a graphical depiction with more information?

As there is no ready to use function for this purpose, we have to look on how this can be achieved using other means in the Julia ecosystem. That’s what we will do in part III of the tutorial … so follow me on that path :-).

Conclusions

The example above showed how the comprehensive meta-data in MLJ can be used to choose an appropriate model and how the typical steps of a ML workflow can be applied using its common interface. This interface is the same for all models supplied. This reduces the learning curve as well as the potential for application errors when using a variety of models considerably.

In this tutorial only a small part of MLJs functionality could be presented. Apart from this rather bare-bone excerpt, the package offers substantial functionality in the following areas:

  • Evaluation: A vast set of measures is available to be used for evaluating models. A variety of hold-out and cross-validation strategies as well as user-defined sampling-strategies can be applied within this process.
  • Tuning: Several strategies for (semi-)automated tuning of hyper parameters are provided (and user-defined strategies can be added).
  • Composing models: MLJ offers different variations of model composition.
    Pipelines can be used for chaining (and thus automating) different processing steps of a ML workflow.
    Homogeneous Ensembles allow the composition of models of the same type (but with different inner workings) e.g. to reduce the effects of overfitting.
    Model Stacking is a method to create a new model out of models of different types.
    All new models created using these composition mechanisms are ‘first class citizens’. I.e. they can be used in the same way as the base models.

The advantages of scientific types (which is a building block of MLJ) have already been shown in part I of the tutorial.

In contrast to e.g. Python, where every performance-critical code has to be written in C, we can apply here one language (namely Julia) for the whole ML stack. Apart from the reduced complexity and the reduced effort for maintenance, this approach has other advantages:

  • Hyper-parameters can be tuned using gradient-descent algorithms (which in turn use automatic differentiation based on a one-language environment).
  • Performant hardware architectures (GPUs, parallel computing) can be used without major code refactoring.

Further information

  • MLJ project home page at the Alan Turing Institute.
  • The documentation for MLJ as well as a lot of introductory material can be found here.
  • Anthony Blaom has created a series of interactive tutorials (recently with a little help from my side) based on Jupyter- and Pluto-notebooks covering all important aspects of MLJ. They originate from a workshop he held at JuliaCon 2020.

--

--

CEO at adviion.de, Lecturer at KIT.edu and dhbw.de, Dr. rer. pol. —Fields of interest: Data Science, Machine Learning, Software Engineering, Project Management