Detection of vertebral column pathologies using decision trees

Building decision trees with Scikit-learn library

Amanda Iglesias Moreno
Towards Data Science

--

Photo by Joyce McCown on Unsplash

AI has tremendous potential in the healthcare context and has been continuously growing in this area over the last few years. The medical industry is using Artificial Intelligence to make smarter and also more accurate decisions. The applications of machine learning in healthcare are wide-ranging from disease diagnosis and identification to robotic surgery, providing in most cases results that are beyond human capabilities.

In this article, you will learn how to build binary classification models to detect patients with spine pathologies using decision trees. The models are constructed using Python and specifically the Scikit-learn library; however, it is important to stress that many other programming languages have also libraries available to easily build classification models (e.g. R and Java).

To easily follow the content of the article, every section contains a theoretical introduction as well as explanatory illustrations. They provide support to understand with greater ease the programming code used for building and evaluating classification models. ❤️

Data set

The data set used in this article is available in the UC Irvine Machine Learning Repository and contains six biomechanical features used to classify patients into 2 groups: (1) normal, and (2) abnormal. The class NO (normal) includes 100 patients with no spinal diseases, while the class AB (abnormal) contains 210 patients suffering from vertebral column problems.

https://archive.ics.uci.edu/ml/datasets/Vertebral+Column

Data Reading and cleaning

The first step of the analysis consists of reading and storing the data in a Pandas data frame using the pandas.read_csv function. The CSV file available in the UCI repository does not contain a header indicating column names so that they have to be specified manually with the argument names.

The data set contains 6 independent variables indicating biomechanical attributes of the spine: (1) pelvic_incidence, (2) pelvic_tilt, (3) lumbar_lordosis_angle, (4) sacral_slope, (5) pelvic_radius, (6) degree_spondylolisthesis. The column class indicates whether the patient has a spine disorder or not.

The objective is to build a model using all biomechanical features to predict the presence of spine pathology. In this particular case, we are going to use only decision tree algorithms to build the models; however, a more thoughtful analysis would include the use of many more supervised learning algorithms with the aim of finding the model with greater predictive power.

Exploratory data analysis

Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumptions before performing further evaluations.

At the beginning of EDA, we want to know as much information as possible about the data, this is when the pandas.DataFrame.info method comes in handy. This method prints a concise summary of the data frame, including the column names and their data types, the number of non-null values, and the amount of memory used by the data frame.

As shown before, the data set does not contain null values, and all data types are correct; therefore, the data is ready to be used in the construction of the classification model without additional modifications.

During the EDA, it is also interesting to visualize the distribution of the variables as well as the correlation between them. Pair plots are an easy way to rapidly explore both things. They provide histograms in the main diagonal of the plot to examine the distribution of the numeric variables. In addition, the positions out of the main diagonal provide scatter plots of the numeric variables to easily analyze the relationship between them.

Image generated by the author with seaborn

Looking at the pair plot, we can conclude that the attribute degree_spondylolisthesis presents high values when the patient has a spine problem, and according to the histograms, it is expected that this attribute has huge importance in classifying the patients.

With respect to the rest of the attributes, patients with lowpelvic_radius tend to suffer spine anomalies with more frequency; however, the rest of the attributes present lower values when the patient is healthy and does not suffer any spine pathology.

Data splitting: training and testing sets

The first step when building a model is to split the data into two groups, which are typically referred to as training and testing sets. The training set is used by the machine learning algorithm to build the model. The test set contains samples that are not part of the learning process and is used to evaluate the model’s performance. It is important to assess the quality of the model using unseen data to guarantee an objective evaluation.

Training and testing sets (Image created by the author)

First, we create a variable X to store the independent attributes (biomechanical features). Similarly, we create a variable y, containing only the target variable (class).

Then, we can use the train_test_split function from the sklearn.model_selection package to create both the training and testing sets.

The default split percentage assigns 75% of the observations to the training set and 25% to the testing set; however, we can indicate a different distribution using the arguments train_size and test_size. In this case, we use the default percentage of splitting 75/25.

The argument shuffledetermines whether or not the data should be shuffle before splitting. It is recommended to randomly split the data to assure that the distribution of the target variable is similar in both data sets (training and testing sets), meaning both sets are representative. The argument shuffle takes by default a value of True; however, it can also be specified explicitly, indicating shuffle=True.

It is important to bear in mind that the results obtained each time we run the code are going to be the same because the parameter random_state is set to 40, and not to None (default value). This parameter controls that the shuffling applied to the data before splitting will always be the same.

After splitting the data, it is recommended to check that the target variable is equally distributed in both sets. As shown below, the percentage of patients with spine pathologies is similar in the training and testing sets (around 65%).

The decision tree algorithm — theory

Elements

Decision trees are a supervised machine learning algorithm based on a sequence of hierarchical questions. These questions will divide the space into multiple linear spaces to predict the outcome. The response can be discrete (a class) or continuous (a real number).

Decision trees are made up of three elements: the nodes, the branches, and the leaves. The nodes describe a test condition on an attribute and the branches represent the outcome based on this test condition. The leaves, also known as terminal nodes, represent the end of the decision path (the response), and unlike the internal nodes, they do not split any further.

Components of a decision tree (Image created by the author)

Lastly, it is important to mention that the decision tree algorithm is recursive, meaning the data is split into smaller chunks until all leaves are pure (if a pre-pruning technic is not implemented).

Advantages and disadvantages

The main advantage of decision trees is that the models obtained are highly interpretable, in contrast with other machine learning algorithms as neural networks. Neural networks achieve better results in terms of accuracy than decision trees; however, they do not present an easily understandable model. It is impossible to explain how decisions are made based on the weights of the network. In contrast, decision trees can be easily visualized and understood even by non-technical people.

Another advantage is that decision trees do not require intensive pre-processing tasks, as for instance, the normalization of the data. There are machine learning algorithms that are highly sensitive to the range of the variables, for example, K-means. K-means is affected by the scale of the attributes because is a distance-based algorithm. In this case, normalization is pretty much a requirement. If not implemented, the features with higher values will dominate the learning process, and the clusters obtained will be wrong. However, decision trees are based on rules rather than on distances. For this reason, they are unaffected by the range of the variables and normalization is not necessary. In addition, when working with categorical features, decision trees do not require the creation of dummy variables.

Decision trees are a powerful easy-to-use machine learning algorithm; however, they also present disadvantages. The main problem of decision trees is overfitting. Decision trees are over-sensitive to the training set, leaning the noise present in the data. If the training data presents a lot of noise, the trees obtained will be really complex and difficult to follow. Fortunately, there are two mechanisms, already implemented in most machine learning libraries, to avoid overfitting: pre-pruning and post-pruning technics.

Advantages and disadvantages of decision trees (Image created by the author)

Splitting criteria

There are different splitting criteria for building decision trees. The Scikit- learn library has two criteria implemented to measure impurity: (1) Gini index (default option), and (2) entropy.

The Gini index criterion is based on minimizing the probability of misclassification. An index of 0 would represent a perfect split, meaning all observations belong to the same class. On the contrary, an index of 0.5 represents the highest disorder possible (in a binary classification). This happens when the probability of all classes is the same.

Gini index formula

In binary decision trees, the Gini index of the split is calculated by multiplying the Gini indexes of the child nodes (left and right) by the proportion of cases each node has. The attribute that has the maximum Gini gain (Gini index of the parent minus Gini index of the split) is selected as the splitting attribute.

The entropy criterion is also commonly used to split the data when building a decision tree. As shown below, this criterion ranges from 0 to 1, where 0 represents a perfect split (pure node), as the Gini criterion does. However, the maximum value of entropy is 1, in contrast with the Gini index that takes a value of 0.5.

Entropy formula
Entropy (image created by the author)

At each stage, the data is going to be split based on the attribute that presents more information gain (entropy of the parent minus weighted average entropy of the child nodes).

It is important to point out that there are no rules for deciding which of the splitting methods is more appropriate, Gini index or entropy. In fact, in practice, both of them produce similar results.

Building a decision tree with Scikit-learn

After this brief theoretical explanation, it is time to build the first decision tree using the Scikit-learn library.

To do so, we will use the DecisionTreeClassifier class from the tree module. This class takes several parameters as input (all of them with a default value). In this particular case, we employ entropy as the splitting method to build the tree; therefore, we must set the parameter criterion='entropy', since the default splitting method is the Gini index. Once the decision tree is defined, we call the fit method to train the algorithm.

We can visualize the decision tree obtained using the export_graphviz function from the tree module. This function exports a decision tree in a dot format, which can be used by the graphvizlibrary to generate a png image.

A decision tree with default parameters

Without pre-pruning, the tree grows until all leaves are pure. As you can observe, all leaves have an entropy equal to 0. This gives a tree where all the instances of the training set are correctly classified. If we do not restrict the depth of the tree, we are running the risk of obtaining a complex model that is not able to generalize to new data properly. In the following section, we will apply pre-pruning methods to reduce the complexity of the tree, obtaining much more simple decision trees without performance loss.

Feature importance

The feature importance represents how relevant is a feature in distinguishing between different classes. In Scikit-learn, we can obtain the relative importance of each feature with the feature_importances_ attribute. The importance of a feature is computed as the (normalized) total decrease of the impurity brought by that feature.

The higher the number, the larger the importance of the feature. As shown above, the attribute degree_spondylolisthesis presents the higher importance when predicting whether a patient has a spine problem.

Confusion matrix

The confusion matrix, also known as the error matrix, is used to evaluate the performance of a machine learning model by examining the number of observations that are correctly and incorrectly classified. Each column of the matrix contains the predicted classes while each row represents the actual classes or vice versa. In a perfect classification, the confusion matrix will be all zeros except for the diagonal. All the elements out of the main diagonal represent misclassifications. It is important to bear in mind that the confusion matrix allows us to observe patterns of misclassification (which classes and to which extend they were incorrectly classified).

In binary classification problems, the confusion matrix is a 2-by-2 matrix composed of 4 elements:

  • TP (True Positive): number of patients with spine problems that are correctly classified as sick.
  • TN (True Negative): number of patients without pathologies who are correctly classified as healthy.
  • FP (False Positive): number of healthy patients that are wrongly classified as sick.
  • FN (False Negative): number of patients with spine diseases that are misclassified as healthy.
Image created by the author

Now that the model is trained, it is time to evaluate its performance using the testing set. First, we use the previous model (decision tree) to predict the class labels of the testing data (with the predict method). Then, we construct the confusion matrix using the confusion_matrix function from the sklearn.metrics package to check which observations were properly classified. The output is a NumPy array where the rows represent the true values and the columns the predicted classes.

As shown above, 65 observations of the testing data were correctly classified by the model (45 true positives and 20 true negatives). On the contrary, we can observe 13 misclassifications (7 false negatives and 6 false positives).

Evaluation metrics

Evaluating the quality of the model is a fundamental part of the machine learning process. The most used performance evaluation metrics are calculated based on the elements of the confusion matrix.

  • Accuracy: It represents the proportion of predictions that were correctly classified. Accuracy is the most commonly used evaluation metric; however, it is important to bear in mind that accuracy can be misleading when working with imbalanced datasets.
  • Sensitivity: It represents the proportion of positive samples (diseased patients) that are identified as such.
  • Specificity: It represents the proportion of negative samples (healthy patients) that are identified as such.
  • Precision: It represents the proportion of positive predictions that are actually correct.

We can calculate the evaluation metrics manually using the numbers of the confusion matrix. Alternatively, Scikit-learn has already implemented the function classification_report that provides a summary of the key evaluation metrics. The classification report contains the precision, sensitivity, f1-score, and support (number of samples) achieved for each class.

As shown above, we obtain a sensitivity of 0.87 (45/(45+7)) and a specificity of 0.77 (20/(20+6)). The model obtained predicts more accurately patients with spine pathologies. This should not surprise us at all, since decision trees are usually biased toward the classes with more observations (in this case AB).

As you may have noticed, the previous summary does not contain the accuracy of the classification. However, this can be easily calculated using the function accuracy_score from the metrics module.

The proportion of correct predictions made by the model is 83.33%.

Pre-pruned decision tree

Decision trees are prone to overfitting, for this reason, it is common to apply pruning technics to prevent the tree to overfit the noise in the data. These technics can be classified into two groups:

  • Pre-pruning: this approach prevents the tree to continue growing based on a given condition.
  • Post-pruning: the tree grows until all leaves are pure. Then, the branches that are not significant are converted to a leaf.

The following pre-pruning methods can be applied to the DecisionTreeClassifier constructor to prevent the tree to continue growing:

  • The limitation of the maximum depth of the tree with the parameter max_depth. This parameter takes a default value of None, meaning the tree is going to grow until all the observations of the training data are correctly classified. It is important to be careful with the limitation of this parameter since low values tend to create under-fitted models (too general).
  • The limitation of the maximum number of leaves with the parameter max_leaf_nodes. This parameter sets the maximum number of final nodes that the decision tree can have. By default, it takes a value of None.
  • The determination of the minimum number of observations in a node to continue splitting it with the parameter min_samples_split. This parameter takes a default value of 2 since a node with two samples could be split into two nodes (each with one sample).

The code below creates a decision tree model with pre-pruning. In this case, the maximum number of levels of the decision tree is set to 8. The decision tree is going to stop growing after 8 rounds of spitting. For this reason, it is expected a much more simple tree as before, since without pre-pruning the number of levels of the decision tree exceeds 8.

As shown below, the evaluation metrics with pre-pruning are pretty similar to those previously obtained. The proportion of correct predictions made by the model is 82.05%. And, as before, the decision tree predicts more accurately patients with spine pathologies (the sensitivity is higher than the specificity).

However, the decision tree with pre-pruning is simpler and easier to understand, as shown in the graph below.

A decision tree with pre-pruning

In this case, the decision tree does not perfectly fit the training data. There are some leaves with an entropy higher than 0. As a result, the tree obtained is much easier to interpret than the previous one.

Pre-pruning methods are more efficient (less time complexity) than post-pruning techniques, as it is not necessary to generate the whole tree and then delete the branches that do not provide greater accuracy to the model. However, as you may have thought, pre-pruning methods have a disadvantage. It is usually difficult to precisely estimate when to stop growing the tree.

Weighted decision tree

In some classification problems, there are specific errors that are more severe than others. In medical diagnosis problems, it is generally much worse to commit a false negative misclassification, since that represents diagnosing a patient as healthy, when indeed she/he has a disease (in this case a spine problem).

The cost matrix allows overcoming this problem, targeting costly misclassifications (in this case false negatives) more severely. It basically makes some misclassifications more expensive than others. The cost matrix is a n*n matrix (n equal to the number of classes) where all the elements of the main diagonal are equal to zero as they represent correct classifications (true positives and true negatives). The rest of the elements will be greater than zero, being one of the non-diagonal weights larger than the other in order to punish one type of misclassification more severely (in this case a false negative).

We can implement cost-sensitive decision trees in Scikit-learn via the parameter class_weight. This parameter is a dictionary where the keys are the class labels and the values are the weights of the cost matrix.

When building the decision tree, the algorithm is going to take into consideration the error cost instead of the information gain. In the following example, the misclassification cost of a false negative is 4 times larger than the misclassification cost of a false positive.

As a consequence, the weighted decision tree presents higher sensitivity (90%) than the previous models. In this case, we detect more effectively ill patients, getting only 5 false negatives. The negative side is the reduction of specificity. Now, we detect with less performance healthy patients.

This model is even more biased toward the class with more observations (AB - patients with spine pathologies), which matches the class that we want to predict with higher accuracy. It is important to keep in mind that in healthcare false negatives are much worse than false positives. Telling a sick patient that is healthy may become life-threatening as they fail to receive the treatment they require.

Choosing hyperparameters — grid search

The choice of hyperparameters can highly influence the performance of a machine learning model. Grid search is a common practice to find the optimal hyperparameters. With this technique, we analyze all combinations of hyperparameters and we finally choose the best-performing ones. We can easily implement grid search in Scikit-learn using the GridSearchCV class from the sklearn.model_selection package.

First of all, we specify the set of parameter values using a dictionary (grid_parameters) where the keys are the hyperparameters and the values are the set of options we want to evaluate. Then, we define the GridSearchCV object for trying different parameter combinations. This class takes as input the following parameters (among other default values):

  • estimator: the machine learning algorithm employed for hyperparameter tuning.
  • param_grid: dictionary with hyperparameter names and values.
  • scoring: performance strategy that we want to maximize.

After fitting the grid object, we can obtain the best parameters using best_params_attribute. In this particular case, we want to maximize the sensitivity (recall score), since we want to focus on correctly predicting patients with pathologies. We do not want to send any sick patients back home.

Finally, we use the GridSearchCV object to make predictions using the best parameters ({‘class_weight’: {‘AB’: 4, ‘NO’: 1}, ‘criterion’: ‘gini’, ‘max_depth’: 3}) . As shown above, the model presents a sensitivity of 96% (much larger than the previously obtained sensitivities). However, this model predicts with lower accuracy the patients without spine pathologies (specificity of 54%).

Grid search is a widely used methodology for hyperparameter tuning; however, this technique has a major drawback — time complexity. A model (decision tree) is built for each combination of hyperparameters (in this case 2*9*3=54 models). If the number of hyperparameters and values is really high, the number of models to assess will escalate accordingly.

Important consideration — Randomness

In Scikit-learn, the DecisionTreeClassifier will randomly select a feature when two splits are tied, meaning their improvement criteria is identical. That is why different runs of the code may provide different outcomes. So don’t worry if you rerun the code and you obtain different results from those provided in this article. To guarantee the same results every time the code is run, we should set a random seed using the random_seed parameter (when initializing the DecisionTreeClassifier constructor).

Building classification models with Scikit-learn is an easy and straightforward procedure. This is due to the large number of algorithms and evaluation metrics already implemented which make the process of building classification models a really smooth task. Decision trees do not provide the accuracy that deep learning algorithms do; however, they are easily interpretable even by non-technical users. The result obtained can help doctors and researchers to understand which features contribute the most to spine pathologies, and thereby improving their detection capabilities.

Thanks for reading ❤️

Amanda Iglesias

--

--