MACHINE LEARNING
Let’s say you need to work on a regression machine learning project. You analyze your data, do some data cleaning, create a few dummy variables, and now it’s time to run a machine learning regression model. What are the top ten models that come to your mind? Most of you probably don’t even know that there are ten regression models out there. Don’t worry if you don’t know because, by the end of this article, you will be able to run not only ten machine learning regression models but over 40!
A few weeks ago, I wrote the How to Run 30 Machine Learning Models with a Few Lines of Code blog, and the reception was very positive. In fact, it’s my most popular blog so far. In that blog, I created a classification project to try Lazy Predict. Today, I will test the Lazy Predict on a regression project. To do so, I will use the classic Seattle House Price dataset. You can find it on Kaggle.
What is Lazy Predict?
Lazy Predict helps build dozens of models without much code and helps understand which models works better without any parameter tuning. The best way to show how it works is with a short project, so let’s start.
Regression Project with Lazy Predict
First of all, to install Lazy Predict, you can copy and paste pip install lazypredict
to your terminal. It’s that simple. Now, let’s import some libraries we will use in this project. You can find the complete notebook here.
# Importing important libraries
import pyforest
from lazypredict.Supervised import LazyRegressor
from pandas.plotting import scatter_matrix
# Scikit-learn packages
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn import metrics
from sklearn.metrics import mean_squared_error
# Hide warnings
import warnings
warnings.filterwarnings("ignore")
# Setting up max columns displayed to 100
pd.options.display.max_columns = 100
You can see that I imported pyforest
instead of Pandas and Numpy. PyForest imports all important libraries to the notebook very quickly. I wrote a blog about it, and you can find it here. Now, let’s import the dataset.
# Import dataset
df = pd.read_csv('../data/kc_house_data_train.csv', index_col=0)
Let’s see how the dataset looks like.
Ok, now let’s check the data types.
# Checking datatimes and null values
df.info()
Now, a few things that caught my attention. The first one is that the id
columns don’t have any relevance to this short project. However, if you want to dive deeper into the project, you should check for duplicates. Also, the date
columns are an object type. We should change it to DateTime type. The columnszipcode
, lat
, and long
probably have little or no correlation with the price as they are. However, since the objective of this project is to demonstrate lazy predict
, I will keep them.
Let’s now check some statistics and see if we can find anything that we should change before running our first models.
All right. I can see some interesting stuff. First, there is a house with 33 bedrooms. That can’t be right. Thus, I checked online, and it turns out that I found the house online using its id
, and it actually has three bedrooms. You can find the house here. Also, it looks like that there are houses with 0 bathrooms. I will include at least one bathroom, and we should be done with the data cleaning.
# Fixing house with 33 bedrooms
df[df['bedrooms'] == 33] = df[df['bedrooms'] == 3]
# This will add 1 bathroom to houses without any bathroom
df['bathrooms'] = df.bedrooms.apply(lambda x: 1 if x < 1 else x)
Train Test Split
Now we are ready for train test split, but before, let’s make sure we don’t have nan
or infinite
values with this code:
# Removing nan and infinite values
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
Let’s split the dataset into X
and y
variables. I will assign 75% of the dataset to the train set and 25% to a test set.
# Creating train test split
X = df.drop(columns=['price])
y = df.price
# Call train_test_split on the data and capture the results
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3,test_size=0.25)
Time for some fun! The following code will run over 40 models and show the R-Squared and RMSE for each model. Ready, set, go…
reg = LazyRegressor(ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)
print(models)
Wow! These results are great for the amount of work spent on it. Those are great R-Squared and RMSE for vanilla models. As we can see, we ran 41 vanilla models, got the metrics we needed, and you can see the time spent for each model. Not bad at all. Now, how can you be sure that these results are correct? We can run a model and check the results to see if they are anywhere close to what we got. Shall we test the Histogram-based Gradient Boosting Regression Tree? If you’ve never heard about this algorithm, don’t worry because I’ve never heard of it either. You can find an article about it here.
Double-Checking Results
First, let’s import this model using scikit-learn.
# Explicitly require this experimental feature
from sklearn.experimental import enable_hist_gradient_boosting
# Now you can import normally from ensemble
from sklearn.ensemble import HistGradientBoostingRegressor
Also, let’s create a function to check the model metrics.
# Evaluation Functions
def rmse(model, y_test, y_pred, X_train, y_train):
r_squared = model.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print('R-squared: ' + str(r_squared))
print('Mean Squared Error: '+ str(rmse))
# Create model line scatter plot
def scatter_plot(y_test, y_pred, model_name):
plt.figure(figsize=(10,6))
sns.residplot(y_test, y_pred, lowess=True, color='#4682b4',
line_kws={'lw': 2, 'color': 'r'})
plt.title(str('Price vs Residuals for '+ model_name))
plt.xlabel('Price',fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.show()
Finally, let’s run the model and check the results.
# Histogram-based Gradient Boosting Regression Tree
hist = HistGradientBoostingRegressor()
hist.fit(X_train, y_train)
y_pred = hist.predict(X_test)
Voilá! The results were very close to what we got using Lazy Predict. It seems like it really works.
Final Thoughts
Lazy Predict is a fantastic library, easy to use, and fast that runs vanilla models with very few lines of code. Instead of manually setting up multiple vanilla models, you can manually do it with 2 or 3 lines of code. Please keep in mind that you should not consider the results as final models, and you should always double-check the results to make sure that the library is working correctly. As I mentioned in other blogs, Data Science is a complex field, and Lazy Predict cannot substitute the expertise of a professional who will optimize the models. Please let me know how it worked for you and if there are any additional questions left.