One Step to Make Decision Trees Produce Better Results

Background, implementation, and model improvement

Gabe Verzino
Towards Data Science

--

Among the trees (photo by author)

Decision trees (DT) get ditched much too soon.

It happens like this:

The DT is trained. Natural overfitting presents. Hyper-parameters get tuned (unsatisfactorily). Finally, the tree is replaced with Random Forest.

While that may be a quick win for performance, the replacement prioritizes a “black box” algorithm. That’s not ideal. Only a DT can produce intuitive results, offer business leaders the ability to compare trade-offs, and gives them a critical role in process improvement.

What you can’t understand or even explain, doesn’t make it to production. This is especially true in industries where even small failures present extreme risks, like healthcare.

(Side note: People often ask “Random Forests produce feature importances, doesn’t that explain what features are important?” Not really. Feature importances almost immediately get interpreted as causal drivers (e.g., features that have dependency to target), but they’re nothing more than model drivers. While helpful to the technician in that respect only, feature importances are generally: (1) useless in weak models (2) inflate in features with high cardinality, and (3) bias toward correlated features. This is another line of inquiry all together, but that’s basically the rub.)

How Decision Trees make decisions

Sticking with DTs will preserve your ability to communicate results well, but how do you make them performant? Hyperparameter tuning only gets you so far. And thoughtful feature engineering should be done regardless.

As it turns out, the specific structure of feature data may allow it to adapt better to the underlying DT algorithm, which in turn allows the DT to produce better decisions.

Under the hood, DTs separate classes by creating orthogonal decision boundaries (perpendicular splits) between classes among all the data you supply it. It does this in a greedy algorithmic way — taking the features that split the best first, then moving to less optimal splits in other features.

We can visually inspect our features for orthogonal decision boundaries. Let’s view features from the publicly available breast cancer dataset below. On the top plot below, plotting “Worst Perimeter” and “Mean Perimeter” produces a good orthogonal decision boundary that can separate Malignant and Benign classes well. So, these would be great features for DT model inclusion.

Picture by author

The bottom plot above shows “Mean Area” and “Mean Perimeter” for which the DT made orthogonal decision boundaries (as it inherently does), but these are unnecessarily convoluted. Probably, a diagonal separation would have been better here, but that’s not how DT classifiers split. Furthermore, DTs are very sensitive to even small variations in training data, such as outliers, which are known to produce vastly different trees.

To accommodate this unique and underlying mechanism of DTs — and ultimately improve performance and generalizability — Principal Component Analysis (PCA) can be applied.

PCA improves DT performance in two important ways:

(1) orients key features together (that explain the most variance)

(2) reduces the feature space

In fact, the PCA + DT process naturally surfaced the “Worst Perimeter” and “Mean Perimeter” features that you see above in the top plot. These are two of the most predictive variables and not surprisingly have an excellent orthogonal decision boundary.

Implementing the Process

Recall, PCA is designed for continuous data. The breast cancer dataset is entirely comprised of continuous variables. (Another Side Note: I see PCA being used on categorical variables — not recommended. Nominal levels don’t have implied distances, ordinal levels aren’t always equidistant, and enforcing distance representations on discrete features generally reconstitutes the variable into something meaningless. Another tangent for another time).

Let’s begin by downloading the needed packages, and transform our breast cancer dataset into features X and target variable y.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

For inspection, the dataframe head of this dataset can be called.

cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
df.head()
Picture by author

First, train the DecisionTreeClassifier without PCA, and collect those predictions (original_predictions).

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit a Decision Tree Classifier on the non-PCA-transformed dataset
original_tree = DecisionTreeClassifier(random_state=42)
original_tree.fit(X_train, y_train)

# Predictions on the original dataset
original_predictions = original_tree.predict(X_test)

Now, apply PCA to select the minimum number of dimensions that can explain the most variance in the training set. Instead of arbitrarily choosing that number of dimensions, the “elbow method” can be used to identify the number of dimensions that would explain 99% of the variance (as hard-coded below).

# Finding the optimal number of PCA components using the elbow method
pca = PCA()
pca.fit(X_train)

explained_variance = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance)

# Plot explained variance
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Explained Variance')
plt.grid()
plt.show()

# Determine the optimal number of components (elbow point)
optimal_num_components = np.where(cumulative_explained_variance >= 0.99999)[0][0] + 1

print(f"Optimal number of PCA components: {optimal_num_components}")

Based visually on where the graph makes an “elbow”, it finds that 6 PCA components explain 99% of the training set variance.

Picture by author

Now apply PCA to capture 6 principal components on the training set. You can do so using Singular Value Decomposition (SVD) which is a standard matrix factorization technique (a process outside the scope here). As before, train the DecisionTreeClassifier on the PCA-transformed training set and collect those predictions (pca_predictions).

# Apply PCA with the optimal number of components
pca = PCA(n_components=optimal_num_components, svd_solver="full")
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Fit a Decision Tree Classifier on the PCA-transformed dataset
pca_tree = DecisionTreeClassifier(random_state=42)
pca_tree.fit(X_train_pca, y_train)

# Predictions on the PCA-transformed dataset
pca_predictions = pca_tree.predict(X_test_pca)
# Confusion matrix
pca_cm = confusion_matrix(y_test, pca_predictions)

# Precision and Recall scores for the original dataset
original_precision = precision_score(y_test, original_predictions, average=’weighted’)
original_recall = recall_score(y_test, original_predictions, average='weighted')
original_accuracy = accuracy_score(y_test, original_predictions)

# Precision and Recall scores
pca_precision = precision_score(y_test, pca_predictions)
pca_recall = recall_score(y_test, pca_predictions)
pca_accuracy = accuracy_score(y_test, pca_predictions)

# Output precision and recall scores
print(f"Original Dataset - Precision: {original_precision:.4f}, Recall: {original_recall:.4f}, Accuracy: {original_accuracy:.4f}")
print(f"PCA-Transformed Dataset - Precision: {pca_precision:.4f}, Recall: {pca_recall:.4f}, Accuracy: {pca_accuracy:.4f}")

Now we can compare our original_predictions (non-PCA-transformed) to the pca_predictions (PCA-transformed) to observe any relative improvement in our evaluation metrics of accuracy, precision and recall.

In comparison to the original DT trained data, when we PCA-transform the dataset first and then perform DT training, we get improvement across the board:

We can plot the confusion matrices to show the relative improvement of classification for Malignant and Benign tumors between the two DTs.

# Plot the confusion matrices
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.heatmap(original_cm, annot=True, fmt="d", cmap="Blues", xticklabels=data.target_names, yticklabels=data.target_names)
plt.title("Original Decision Tree Confusion Matrix\nPrecision: {:.2f}, Recall: {:.2f}".format(original_precision, original_recall))
plt.xlabel("Predicted")
plt.ylabel("True")

plt.subplot(1, 2, 2)
sns.heatmap(pca_cm, annot=True, fmt="d", cmap="Blues", xticklabels=data.target_names, yticklabels=data.target_names)
plt.title("Decision Tree with PCA Confusion Matrix\nPrecision: {:.2f}, Recall: {:.2f}".format(pca_precision, pca_recall))
plt.xlabel("Predicted")
plt.ylabel("True")

plt.tight_layout()
plt.show()
Picture by author

Finally, it‘s valuable to identify which of our original features are being used to generate the 6 principal components. Technically, PCA generates new features that are linear combinations of the original features. These new features are orthogonal to each other and are ranked in order of the variance they explain. However, calling components_attribute can identify the features used in the creation of those components.

# Get the explained variance ratio of each principal component
explained_variance_ratio = pca.explained_variance_ratio_

# Get the PCA components
pca_components = pca.components_

# Create a DataFrame to display the contributions of original features to each principal component
df_components = pd.DataFrame(data=pca_components, columns=data.feature_names)

# Print the top features contributing to each principal component
for i in range(optimal_num_components):
print(f"Top features for PC{i+1}:")
sorted_features = df_components.iloc[i].abs().sort_values(ascending=False)
print(sorted_features.head())
print("\nExplained Variance Ratio:", explained_variance_ratio[i])
print("=" * 50)

And thus, for the 6 principal components that we selected, the model created those using the following 5 features:

Picture by author

Conclusion

Decision Trees get ditched too soon in lieu of more performant algorithms. While highest performance is important, it may not be best — that decision ultimately depends on your stakeholder needs and explaining why a model is suggesting a particular outcome (see “explainable AI”).

Instead of reaching for the most technically advanced algorithm, work to optimally prep your data through thoughtful feature engineering and Principal Component Analysis to give decision trees their best chance of revealing their intuitive decision-making capabilities.

Thanks for reading. Happy to connect with anyone on LinkedIn! If you would like to share any interesting data science challenges that you are facing currently, please leave a comment or DM and I’ll be happy to try and explore/write about it.

My most recent articles:

Debugging Logistic Regression Errors — What they mean, how to do

Using Bayesian Networks to Forecast Ancillary Service Volume in Hospitals

Why Balancing Classes is Over-Hyped

--

--