Getting More From Regression Models with RuleFit

Interpretable Machine Learning That Isn’t OLS

Casey Whorton
Towards Data Science

--

Photo by Max Ostrozhinskiy on Unsplash

Contents

  • Introduction
  • California Housing Dataset Example
  • Conclusion
  • References

Introduction

The interpretable side of machine learning has always been interesting to me. I think it is important to be able to plainly state (to some degree) what the model is doing. Some of the most explainable machine learning models are also the weakest in terms of accuracy, so we are forced to make decisions in order to strike a balance.

This article focuses on the RuleFit algorithm, written in Python, to predict a continuous target variable. This topic has been touched on by other authors, and while you can find methods to explain predictions made from more black-box algorithms, they are usually classification problems. I feel the regression side is a topic that deserves much more attention.

Here, you will see how this algorithm can transform the n-dimensional space of input features into smaller subsets of the space that have an explainable effect on the target variable. The subsets and their effect are called rules. These discovered rules can look like what you find during exploratory data analysis, so the rules themselves are business insights in some cases. You will also see how easy it is to use this algorithm in Python.

(If you want to use the Python package for the RuleFit algorithm in your own project, install it using git. Use the link in the references section. Or fork the notebook on kaggle.)

Example (Sine Function)

A function with one feature (X) whose mapping to the target, or dependent, variable (y) like the one shown here will not be explained by an OLS linear model. The RuleFit algorithm searches X for subsets and returns rules for what the target will be. In this case, I set the algorithm to find a maximum of 250 rules. You can see that for some subsets the algorithm returns a flat value, and in others a sloped line. We’ll see more of that shortly.

X = np.arange(100).reshape(-1,1)

# Sine Function
y = [np.sin(x/8) for x in np.arange(100)]

# Fit a linear model using X as the independent variable
lreg = LinearRegression().fit(X,y)

# Record r2 score from linear model
lreg_score = np.round(lreg.score(X,y),2)

# Fit a RuleFit model using X as the independent variable and 100 rules max
rf = RuleFit(max_rules=250,random_state=1)
rf.fit(X, y, feature_names=['x'])
rules = rf.get_rules()

# Record r2 score from RuleFit model
rf_score = np.round(r2_score(rf.predict(X),y),2)
Sine function over X with fitted linear regression model and fitted RuleFit Regressor model. Source: https://github.com/caseywhorton/interpretable-regression-example

Easily Explainable Function

If you work as a Data Scientist in a business setting, you know that sometimes you come across easily explainable linear relationships. Sometimes. What happens more often is that there is some caveat that your partners are aware of that slightly change the relationship. One type of caveat I find is an expectation for a dramatic increase or ‘step’ in a dependent variable due to a change in an independent variable. Think about price and demand: once the price decreases past a certain threshold, you expect the demand to jump up to another bracket, not a gradual increase anymore.

Here is an example of a function that is explained well with an OLS linear model. But, what the linear model will miss is the fact that there is a step at about where X = 50. The RuleFit model has 4 rules to explain what is happening to y over X. Below is a graphical representation and the table of discovered rules.

Left: How individual rule components make up the total RuleFit Model. Right: The fitted RuleFit model over the original function. Source: https://github.com/caseywhorton/interpretable-regression-example
rule                        type       coef  support  importance
0 X linear 0.426891 1.00 12.259688
2 X > 49.5 rule 14.317238 0.44 7.106890
1 X <= 49.5 & X > 15.0 rule -0.000000 0.34 0.000000
3 X <= 49.5 & X <= 15.0 rule -5.262510 0.22 2.179975

Above are the fitted rules from the model. Overall, we expect a linear model, and the first rule suggests that there is a linear relationship over the entire domain of X (support = 1.00) with a high importance. The coefficient overall is 0.42, but this rule would not explain everything. The remaining rules show that for values above 49.5 there is a step increase of 14.3, and for values less than 15 there is a step decrease of 5.26. This is in addition to the linear rule for all X.

This is a fitted model with 4 rules shown, the algorithm can be parameterized for better accuracy.

California Housing Dataset Example

In the original paper for the algorithm (see link in references) you can see how this algorithm works on the Boston Housing dataset, a classic ‘toy’ dataset. This is a similar dataset from the US state of California, so we will see how the algorithm works on the other side of the country. Both datasets can be found in scikit-learn.

The target for the dataset is the median value of homes in a census block, with several features about the homes such as the number of rooms, latitude, longitude, and average house occupancy. This is a regression type problem, so we will see here how we can tackle it using the RuleFit package in Python.

Model Fitting & Evaluation

X, y = fetch_california_housing(return_X_y=True)
features = fetch_california_housing(return_X_y=False)['feature_names']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)model_dict = {
'Gradient Boosted Regressor': GradientBoostingRegressor(),
'Random Forest Regressor':RandomForestRegressor(),
'Decision Tree Regressor':DecisionTreeRegressor(),
'Linear Regression':LinearRegression(),
'RuleFit Regressor':RuleFit(random_state = 101,max_rules = 500)
}

test_results = {}

for model in model_dict.keys():
if model == 'RuleFit Regressor':
model_dict[model].fit(X_train,y_train, feature_names = features)
else:
model_dict[model].fit(X_train,y_train)
r2 = r2_score(model_dict[model].predict(X_test),y_test)
test_results.update({model:np.round(r2,2)})

The results show that the RuleFit with 500 rules has a similar accuracy to other common out-of-the-box machine learning algorithms on this dataset. (We’ll see later that this max_rules parameter should probably be decreased.)

Results:
r2_score
Random Forest Regressor 0.74
Gradient Boosted Regressor 0.70
RuleFit Regressor 0.65
Decision Tree Regressor 0.63
Linear Regression 0.31

Discovered Rules over Subsets

Possibly what is most insightful about the algorithm is that we can examine how discovered rules apply over subsets of the data. We can filter our samples of data in any way we want and see how the algorithm ranks the importance of our rules. Similar to the original paper, and important to the problem, let’s take a look at some rules that apply to the top and bottom 10% of valued homes in California.

Top 10%

For California homes in the top 10%, it looks like location and income are dominant, linear rules.

Examples of discovered rules from the RuleFit algorithm on California Dataset that apply to the top 10% of valued homes. Source: https://github.com/caseywhorton/interpretable-regression-example

Here is an interesting rule that incorporates more than just location:

'medinc <= 6.016050100326538 & aveoccup <= 3.0983457565307617 & latitude <= 37.81500053405762 & longitude <= -122.28499984741211

An increase of 0.47 applies to this very small subset of the top 10% of homes in California. A rule as specific as this would take a while longer to discover using traditional EDA, visualization, and dashboarding. (This took a few seconds to fit and return.)

Bottom 10%

For the bottom 10%, location seems to be everything. A few more rules regarding median income show up in the bottom of the rule set. There are definitely some differences in regards to what rules apply to these sets of home values in California.

Examples of discovered rules from the RuleFit algorithm on California Dataset that apply to the bottom 10% of valued homes. Source: https://github.com/caseywhorton/interpretable-regression-example

Another interesting rule that shows up in the bottom 10% of valued homes have more to do with income and a high number of rooms. It would be interesting to hear a California realtor’s perspective on why they think this was the case when this data was recorded.

'medinc <= 4.5358500480651855 & medinc <= 2.985849976539612 & averooms > 4.3484015464782715'

Feature Importance

We all like to see what features are important in our model. Part of the attractiveness of algorithms like the Random Forest regression algorithm is the ability to quickly show people what features come up often as important in the model. It’s not as direct as showing a causal relationship, but it does go a long way to add interpretation to something that is otherwise a black-box.

Feature importance on entire dataset for California housing values data. Source: https://github.com/caseywhorton/interpretable-regression-example

Above are the important features of the RuleFit algorithm over the entire dataset. Like the discovered rules, this can be applied to subsets of the data to see what features have the most impact on whatever subset of the data you want to see.

Risk of Overfitting

Having models that are easily interpreted usually means a comparative decrease in accuracy to other algorithms. We saw that the RuleFit algorithm can return accuracy similar to other algorithms, but machine learning best practices still need to be followed. The number of rules allowed in the model can serve as a metric for the complexity of the model, which is controlled by the max_rules parameter. Increasing the maximum number of rules and comparing train and test accuracy (r2) show a common case of overfitting.

Source: https://github.com/caseywhorton/interpretable-regression-example

Conclusion

The RuleFit algorithm has potential to be applied in many business problems. The discovered rules from the algorithm can be treated as insights similar to those returned from exploratory data analysis, and we saw how to interpret them in the model. Seeing how important rules are to different subsets of the data is, by itself, a valuable tool in understanding your data. Machine learning best practices still need to be followed, and while the RuleFit algorithm shows good accuracy on the samples included in the article, it wasn’t the best and there is still risk of overfitting if you aren’t careful.

Thank you for reading! Please see the links in the references section to learn more about the algorithm, the Python package, and the Jupyter notebook I used to create this article. Feel free to clap, comment, or add me on LinkedIn.

--

--

Data Scientist | British Bake-Off Connoisseur| Recovering Insomniac | Heavy Metal Music Advocate