The world’s leading publication for data science, AI, and ML professionals.

How to improve the accuracy of a Regression Model

Tips and Tricks to improve model precision

Photo by Marc A on Unsplash
Photo by Marc A on Unsplash

In this post, we will see how to approach a regression problem and how we can increase the Accuracy of a machine learning model by using concepts such as feature transformation, feature engineering, clustering, boosting algorithms, and so on.

Data Science is an iterative process and only after repeated experiments can we get the best model/solution for our requirement.

Data Science Process Flow - Image by Author
Data Science Process Flow – Image by Author

Let’s focus on each of the above phases through an example. I have a health insurance dataset(CSV file) with customer information on insurance charges, age, sex, BMI, etc. We have to predict insurance charges based on these parameters in the dataset. This is a regression problem as our target variable – Charges/insurance cost – is numeric.

Let’s begin by loading the dataset and exploring the attributes (EDA – Exploratory Data Analysis)

#Load csv into a dataframe
df=pd.read_csv('insurance_data.csv')
df.head(3)
#Get the number of rows and columns
print(f'Dataset size: {df.shape}')
(1338,7)
Health Insurance Dataframe - Image by Author
Health Insurance Dataframe – Image by Author

Dataset has 1338 records and 6 features. Smoker, sex, and region are categorical variables while age, BMI, and children are numeric.

Handling Null/Missing Values

Let’s examine the proportion of missing values in the dataset:

df.isnull().sum().sort_values(ascending=False)/df.shape[0]
Percentage of Null Values in the columns - Image by Author
Percentage of Null Values in the columns – Image by Author

Age and BMI have some null values – very few though. We will handle this missing data and then begin our data analysis. Sklearn’s SimpleImputer allows you to replace missing values based on mean/median/most frequent values in the respective columns. In this example, I am using the median to fill null values.

#Instantiate SimpleImputer 
si=SimpleImputer(missing_values = np.nan, strategy="median")
si.fit(df[['age', 'bmi']])

#Filling missing data with median
df[['age', 'bmi']] = si.transform(df[['age', 'bmi']])

Data Visualization

Now that our data is clean, we will look at analyzing data through visualizations and maps. A simple seaborn pairplot can give us a lot of insights!

sns.pairplot(data=df, diag_kind='kde')
Seaborn Pairplot - Image by Author
Seaborn Pairplot – Image by Author

What do we see..?

  1. Charges and children are skewed.
  2. Age shows a positive correlation with Charges.
  3. BMI follows a normal distribution! 😎

Seaborn’s boxplot and countplot can be used to bring out the impact of categorical variables on charges.

seaborn countplots for categorical variables - Image by Author
seaborn countplots for categorical variables – Image by Author
Image by Author
Image by Author

Observations based on the above plots:

  1. Males and females are almost equal in number and on average median charges of males and females are also the same, but males have a higher range of charges.
  2. Insurance charges are relatively higher for smokers.
  3. Charges are highest for people with 2–3 children
  4. Customers are almost equally distributed across the 4 regions and all of them ** have almost the same charge**s.
  5. Percentage of female smokers is less than the percentage of male smokers.

Thus, we can conclude that ‘smoker’ has a considerable impact on the insurance charges, while gender has the least impact.

Let’s create a heatmap to understand the strength of the correlation between charges and numeric features — age, BMI, and children.

sns.heatmap(df[['age', 'bmi', 'children', 'charges']].corr(), cmap='Blues', annot=True)
plt.show()
Correlation Map - Image by Author
Correlation Map – Image by Author

We see that age and BMI have an average +ve correlation with charges.

We will now go over the steps of model preparation and model development one by one.

  1. Feature Encoding

In this step, we convert categorical variables – smoker, sex, and region – to numeric format(0, 1,2, 3, etc.) as most of the algorithms cannot handle non-numeric data. This process is called encoding and there are many ways to do this :

  1. LabelEncoding – Represent categorical values as numbers (For example, a feature such as Region with values Italy, India, USA, UK can be represented as 1, 2, 3, 4)
  2. OrdinalEncoding – Used for representing rank-based categorical data values as numbers. (For example representing high, medium, low as 1,2,3)
  3. One-hot Encoding – Represent categorical data as binary values – 0s,1s only. I prefer to use one-hot encoding over label encoding if there aren’t many unique values in the categorical feature. In here, I have used pandas’ one hot encoding function (get_dummies) on Region and split it into 4 columns – location_NE, location_SE, location_NW, and location_SW. One can also use label encoding for this column, however, one hot encoding gave me a better result.
#One hot encoding
region=pd.get_dummies(df.region, prefix='location')
df = pd.concat([df,region],axis=1)
df.drop(columns='region', inplace=True)
df.sex.replace(to_replace=['male','female'],value=[1,0], inplace=True)
df.smoker.replace(to_replace=['yes', 'no'], value=[1,0], inplace=True)

2. Feature Selection and Scaling

Next, we will select features that affect ‘charges’ the most. I have selected all the features except gender as its effect on ‘charges’ is very less(concluded from the viz charts above). These features will form our ‘X’ variable while charges will be our ‘y’ variable. If there are many features, I suggest you use scikit-learn’s SelectKBest for feature selection to arrive at the top features.

#Feature Selection 
y=df.charges.values
X=df[['age', 'bmi', 'smoker', 'children', 'location_northeast', 'location_northwest', 'location_southeast', 'location_southwest']]
#Split data into test and train
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)

Once we have selected our features, we need to ‘standardize’ the numeric ones — age, BMI, children. Standardization process converts data to smaller values in the range 0 to 1 so that all of them lie on the same scale and one doesn’t overpower the other. I have used StandardScaler here.

#Scaling numeric features using sklearn StandardScalar
numeric=['age', 'bmi', 'children']
sc=StandardScalar()
X_train[numeric]=sc.fit_transform(X_train[numeric])
X_test[numeric]=sc.transform(X_test[numeric])

Now, we are all set to create our first basic model😀  . We will try Linear Regression and DecisionTrees to predict insurance charges

Model scores - Image by Author
Model scores – Image by Author

Mean absolute error (MAE) and root-mean-square error (RMSE) are the metrics used to evaluate regression models. You can read more about it here. Our baseline models give a score of more than 76%. Between the 2, DecisionTrees give a better MAE of 2780. Not bad..!

Let’s see how can we make our model better.

3A. Feature Engineering

We can improve our model score by manipulating some of the features in the dataset. After a couple of trials, I found that the following items improve accuracy:

  1. Grouping similar customers into clusters using KMeans.
  2. Clubbing northeast and northwest regions into ‘north’ and southeast and southwest into ‘south’ in Region column.
  3. Transforming ‘children’ into a categorical feature called ‘more_than_one_child’ which is ‘Yes’ if the number of children is > 1
from sklearn.cluster import KMeans
features=['age', 'bmi', 'smoker', 'children', 'location_northeast', 'location_northwest', 'location_southeast', 'location_southwest']
kmeans = KMeans(n_clusters=2)
kmeans.fit(df[features])
df['cust_type'] = kmeans.predict(df[features])
df['location_north']=df.apply(lambda x: get_north(x['location_northeast'], x['location_northwest']), axis=1)
df['location_south']=df.apply(lambda x: get_south(x['location_southwest'], x['location_southeast']), axis=1)
df['more_than_1_child']=df.children.apply(lambda x:1 if x>1 else 0)
All features— Image by Author
All features— Image by Author

3B. Feature Transformation

From our EDA, we know that the distribution of ‘charges’ (Y) is highly skewed and hence we will apply scikit-learn’s target transformer – QuantileTransformer to normalize this behavior.

X=df[['age', 'bmi', 'smoker', 'more_than_1_child', 'cust_type', 'location_north', 'location_south']]
#Split test and train data
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
regr_trans = TransformedTargetRegressor(regressor=model, transformer=QuantileTransformer(output_distribution='normal'))
regr_trans.fit(X_train, y_train)
yhat = regr_trans.predict(X_test)
round(r2_score(y_test, yhat), 3), round(mean_absolute_error(y_test, yhat), 2), round(np.sqrt(mean_squared_error(y_test, yhat)),2)
>>0.843, 2189.28, 4931.96

Woahh… a whopping 84%…and MAE has reduced to 2189!

4. Use of Ensemble and Boosting Algorithms

Now we will use these features on ensemble-based RandomForest, GradientBoosting, LightGBM, and XGBoost. If you are a beginner and not aware of boosting and bagging methods, you can read more about them here.

X=df[['age', 'bmi', 'smoker', 'more_than_1_child', 'cust_type', 'location_north', 'location_south']]
model = RandomForestRegressor()
#transforming target variable through quantile transformer
ttr = TransformedTargetRegressor(regressor=model, transformer=QuantileTransformer(output_distribution='normal'))
ttr.fit(X_train, y_train)
yhat = ttr.predict(X_test)
r2_score(y_test, yhat), mean_absolute_error(y_test, yhat), np.sqrt(mean_squared_error(y_test, yhat))
>>0.8802, 2078, 4312

Yes! Our RandomForest model does perform well – MAE of 2078👍 . Now, we will try with some boosting algorithms such as Gradient Boosting, LightGBM, and XGBoost.

Model Score— Image by Author
Model Score— Image by Author

All of them seem to perform well:)

5. Hyperparameter Tuning

Let’s tweak some of the algorithm parameters such as tree depth, estimators, learning rate, etc, and check for model accuracy. Manually trying out different combinations of parameter values is very time-consuming. Scikit-learn’s GridSearchCV automates this process and calculates optimized values for these parameters. I have applied GridSearch to the above 3 algorithms. Below is the one for XGBoost:

Best values for parameters in GridSearchCV - Image by Author
Best values for parameters in GridSearchCV – Image by Author

Once we have optimum values for our parameters, we will run all 3 models again with these values.

Model Score - Image by Author
Model Score – Image by Author

This looks much better! We have been able to improve our accuracy – XGBoost gives a score of 88.6% with relatively fewer errors 👏👏

Distribution plot of predicted and actual values of charges ; 2. Residual plot - Image by Author
  1. Distribution plot of predicted and actual values of charges ; 2. Residual plot – Image by Author

Distribution and Residual plots confirm that there is a good overlap between predicted and actual charges. However, there are a handful of predicted values that are way beyond the x-axis and this makes our RMSE is higher. This can be reduced by increasing our data points i.e. collecting more data.

We are now ready to deploy this model into production and test it on unknown data. Well done👍

In a nutshell, the points that improved the accuracy of my model

  1. Creating simple new features
  2. Transforming target variable
  3. Clustering common data points
  4. Use of boosting algorithms
  5. Hyperparameter tuning

You can access my notebook here. Not all of them may work for your model every time. Pick and choose the ones that work best for your scenario:)


Related Articles