The world’s leading publication for data science, AI, and ML professionals.

Lookahead Decision Tree Algorithms

Lookahead mechanisms in decision trees can produce better predictions

Image by Steve Buissinne from Pixabay
Image by Steve Buissinne from Pixabay

TL;DR: I show that decision trees with a single-step lookahead mechanism can outperform standard, greedy decision trees (no lookahead). No overfitting or lookahead pathology is observed in the sample dataset.


Why lookahead?

Suppose we are trying to predict if a potential job candidate can be successful in his job.

Image by Pexels from Pixabay
Image by Pexels from Pixabay

We have a historical dataset that enables us to build a predictive model.

Image by Author
Image by Author

"Success" is our target variable and we have 3 categorical features to use in our predictive model:

  • Role Seniority
  • Candidate Seniority
  • Academic Qualifications

Let’s keep our example simple and assume that these features take one of two values "High" or "Low".

Suppose we build a 2-level decision tree using scikit-learn’s default parameters. Below is the split that is generated.

Image by author
Image by author

Note that the decision tree is able to generate 4 final leafs/buckets which the candidate can fall into. The predicted success of the candidate is 67%, 50%, 0%, or 50% depending on which left/bucket he falls into.

We do not get a clear cut split of leafs where candidates are either 100% successful or 0% in all buckets. The goal of the decision tree is to discriminate potentially successful candidates from unsuccessful ones.

There is actually a better split that could’ve been generated…

In our example, if the candidate seniority and role seniority are both high or both low the candidate would be successful because there is a good match between the candidate’s seniority and the required seniority for the role. This happens 100% of the time. The opposite is also true… if the candidate seniority is high but the role seniority is low, or vice versa, the candidate is successful 0% of the times.

If we construct a tree as per the below, we can place the candidates into leafs that completely separate the successful candidates from unsuccessful ones.

Image by author
Image by author

The reason standard implementations of decision trees would construct the former sub-optimal tree is because decision trees are greedy and assess locally optimal solutions.

The first split would occur for the "Academic Qualifications" feature because it is the local optimum where the decision tree can best discriminate between successful and unsuccessful candidates, 60% (3/5) of candidates with high "Academic Qualifications" were successful in the data, whereas only 33% (1/3) of the candidates with low "Academic Qualifications" were successful.

If we split by "Candidate Seniority" or "Role Seniority" in the first step, we end up with 50% success rates for both candidates with high and low "Candidate Seniority" or "Role Seniority"… the model cannot do any meaningful prediction.

Therefore, the decision tree would prefer to split on the local optimum, "Academic Qualifications" rather than doing a deeper search.

But, if the decision tree was able to lookahead one step, it would find a more efficient split based on "Candidate Seniority" in the first step and "Role Seniority" in the second step or the other way around. See the data.


Cost of looking ahead

A decision tree without a lookahead mechanism would choose a split that maximizes information gain. Normally the Gini-index (scikit-learn default option) or entropy are measured after each possible split in order to choose the optimal one. While the split may maximize information gain after the split, it does not consider information gain in subsequent splits.

In our example above, splitting by "Academic Qualifications" first, produces the best Gini-impurity score after the first split. But it does not give us a path to do a 2nd level split with a perfect, 0 Gini-impurity score, with the ability to seggregate candidates into 100% successful or unsuccessful.

Splitting by either "Role Seniority" or "Candidate Seniority" does.

Gini-impurity: GI=i=1∑Cp(i)∗(1−p(i))

When adding a one-step lookahead, at every split we optimize by considering the subsequent split as well. In the above example, we would find that splitting by "Role Seniority" is better than "Academic Qualifications", because if we follow-up with a split based on "Candidate Seniority", we can achieve a perfect Gini-impurity score of 0.

With only a single-step of looking ahead, we need to assess all possible combinations of splits (current and subsequent split) which is a very expensive operation and requires a lot of computational resources.

Furthermore, there is a lookahead pathology phenomena that is often observed where looking ahead actual leads to a worse outcome and larger decision trees. Lookahead pathology is also observed in robotics and games such as chess where simple heuristics without looking ahead or with limited lookahead tend to outperform more extensive searches that do deeper lookaheads.

Image by PIRO4D from Pixabay
Image by PIRO4D from Pixabay

In our job candidate example, this wasn’t the case, but this does happen. There is no guarantee that the decision tree’s accuracy will improve by looking ahead.


Setting-up our experiment

We will run a simple experiment to test whether lookahead adds any accuracy to a decision tree classifier. There are readily available implementations of decision trees with lookahead mechanism (J48 in Weka for e.g.), but we will build a custom classifier from scratch in Python.

We first prepare a sample classification dataset with 2 classes and split into a training and testing dataset.

RANDOM_STATE = 1
clf_data = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=2,
    n_classes=2,
    shuffle=True,
    random_state=RANDOM_STATE,
)
X_train, X_test, y_train, y_test = train_test_split(
    clf_data[0], clf_data[1], test_size=0.2, random_state=RANDOM_STATE)

The code below is a custom implementation of decision trees that is simple, slow, but it does the job. It includes a single-step lookahead and allows for modifying max depth along with the number of split points/bins (based on percentiles) to consider for each feature. When split points ≥ to the sample size, we would simply assess splits at every value, whereas a lower number of split points is a form of regularization where we assess potential values to split the feature on, based on percentiles.

When we don’t include any lookahead and choose a large number of split points, the predictions produced by the decision tree should be almost identical (not necessarily exact due to an element of randomness) to that of scikit-learn’s DecisionTreeClassifier with default parameters.

We run a quick check…

# Sklearn decision tree
sk_dt = DecisionTreeClassifier(max_depth=3, 
                               random_state=RANDOM_STATE)
# Custom decision tree
custom_dt = CustomDT(max_depth=3, 
                     ppoints=1000) # percentile/split points
sk_dt.fit(X_train, y_train)
custom_dt.train(X_train, y_train)
sk_preds = sk_dt.predict(X_test)
custom_preds = custom_dt.predict(X_test)
# Check if predictions are the same
print("Predictions match rate: "
    f"{round(100*(sk_preds == custom_preds).mean())}%")
Image by author
Image by author

Predictions match in 100% of cases, so that makes our custom decision tree (without lookahead and with large number of split points) comparable to the scikit-learn implementation.

We can re-run our custom tree with lookahead using the below code.

custom_dt = CustomDT(max_depth=3, 
                     ppoints=10)
custom_dt.train(X_train, 
                y_train,
                # Depth of tree constructed in single step lookahead
                lookahead_depth=3)
custom_preds = custom_dt.predict(X_test)

For our experiment, we will check how a single step lookahead compares to no lookahead under different combinations of tree depth and split points (reducing split points reduces overfitting).

To simplify our experiment we will track only 2 accuracy metrics, the F1 score and ROC AUC.


Experiment results

The experiment was run for the below parameters.

params = {
    "depth": [3, 5, 7, 10],
    "ppoints": [3, 5, 10, 20, 100]
}

Below is the code that was used.

results = {
    "depth": [],
    "ppoints": [],
    "lookahead": [],
    "dataset": [],
    "f1": [],
    "roc_auc": []
}
for depth in params["depth"]:
    for ppoints in params["ppoints"]:
        custom_dt = CustomDT(
            max_depth=depth, 
            ppoints=ppoints
        )
        # Decision tree with lookahead
        custom_dt_lk = CustomDT(
            max_depth=depth, 
            ppoints=ppoints
        )

        custom_dt.train(X_train, 
                        y_train
                       )
        custom_dt_lk.train(X_train, 
                           y_train, 
                           lookahead_depth=depth
                          )

        y_pred_train = custom_dt.predict(X_train)
        y_prob_train = custom_dt.predict(X_train, prob=True)
        y_pred_train_lk = custom_dt_lk.predict(X_train)
        y_prob_train_lk = custom_dt_lk.predict(X_train, prob=True)
        y_pred = custom_dt.predict(X_test)
        y_prob = custom_dt.predict(X_test, prob=True)
        y_pred_lk = custom_dt_lk.predict(X_test)
        y_prob_lk = custom_dt_lk.predict(X_test, prob=True)

        for i in range(4):
            results["depth"].append(depth)
            results["ppoints"].append(ppoints)

        results["dataset"].append("train")
        results["lookahead"].append(False)
        results["f1"].append(
            f1_score(y_train, y_pred_train))
        results["roc_auc"].append(
            roc_auc_score(y_train, y_prob_train))
results["dataset"].append("train")
        results["lookahead"].append(True)
        results["f1"].append(
            f1_score(y_train, y_pred_train_lk))
        results["roc_auc"].append(
            roc_auc_score(y_train, y_prob_train_lk))

        results["dataset"].append("test")
        results["lookahead"].append(False)
        results["f1"].append(
            f1_score(y_test, y_pred))
        results["roc_auc"].append(
            roc_auc_score(y_test, y_prob))

        results["dataset"].append("test")
        results["lookahead"].append(True)
        results["f1"].append(
            f1_score(y_test, y_pred_lk))
        results["roc_auc"].append(
            roc_auc_score(y_test, y_prob_lk))

Finally, let’s take a look at the results… below accuracy metrics are for predictions on the training dataset.

df_results = pd.DataFrame(results)
df_results[df_results.dataset=="train"].sort_values(
    by="roc_auc", ascending=False)
Image by author
Image by author

When we try to run the predictions on the training dataset, we obviously get better results with larger max depth. No surprise there since we are fitting the data better.

There is no strong indication of lookahead pathology. The best set-up that has a 100% score on both F1 and ROC AUC is the first which has max_depth = 10, split points/bins = 20, and with lookahead. Lookahead is present in the top-performing set-up.

Splitting features based on 20 percentile bins yields better result than the more granular split based on 100 bins at all max depths when we look at ROC AUC. It could be driven by lookahead pathology but it is more likely just an artifact of a more favorable split happening at the first few levels of the decision trees causing better results at larger max depths as well.

More importantly, let’s check the predictions against the test dataset…

df_results = pd.DataFrame(results)
df_results[df_results.dataset=="test"].sort_values(
    by="roc_auc", ascending=False)
Image by author
Image by author

The best performing set-up on the ROC AUC score uses lookahead. 7 out of the top 8 set-ups contain a lookahead mechanism.

We get similar insights if we sort the decision trees by F1 score.

df_results = pd.DataFrame(results)
df_results[df_results.dataset=="test"].sort_values(
    by="f1", ascending=False)

The top 4 set-ups all have lookahead enabled. The top performing set-up has a max-depth of 7 with 10 bins instead of 5 bins which was the top one when we looked at ROC AUC.

The difference between the top set-up with lookahead and the top one without is 2.3 points on the F1 score and 1.5 on ROC AUC which is quite a significant difference.

The improvement in accuracy from using a single-step lookahead seems to increase with deeper trees which reflects that no overfitting is occurring with looking ahead.

mpl.style.use("fivethirtyeight")
pivoted_results = pd.pivot_table(data=df_results[(
    df_results.dataset == "test")], 
    index=["depth", "ppoints"], 
    columns="lookahead", 
    values="roc_auc")
pivoted_results["Percent Diff"] = (pivoted_results[True] 
    / pivoted_results[False] - 1) * 100
sns.catplot(data=pivoted_results["Percent Diff"].reset_index(), 
            x="depth", y="Percent Diff", aspect=16/9, height=7)
plt.title("Percent Difference in ROC AUC With Lookahead vs. Without")
Image by author
Image by author

Conclusion

In the typical classification dataset we generated from scikit-learn’s make_classification, decision trees with single-step lookahead outperform standard decision trees that don’t look ahead. We don’t observe any clear signs of lookahead pathology or overfitting. Of course, this wouldn’t necessarily be the case in all datasets.

Nonetheless, we show that at least for some datasets, lookahead mechanisms can consistently produce significantly more accurate decision trees.

Given the wide popularity of XGBoost and other boosted trees algorithms, it would be interesting to explore if adding lookahead mechanisms can yield better results. That is not feasible to test in my custom implementation as such a test would require significant computational resources.


Related Articles