
In this post, we will see how to approach a regression problem and how we can increase the Accuracy of a machine learning model by using concepts such as feature transformation, feature engineering, clustering, boosting algorithms, and so on.
Data Science is an iterative process and only after repeated experiments can we get the best model/solution for our requirement.

Let’s focus on each of the above phases through an example. I have a health insurance dataset(CSV file) with customer information on insurance charges, age, sex, BMI, etc. We have to predict insurance charges based on these parameters in the dataset. This is a regression problem as our target variable – Charges/insurance cost – is numeric.
Let’s begin by loading the dataset and exploring the attributes (EDA – Exploratory Data Analysis)
#Load csv into a dataframe
df=pd.read_csv('insurance_data.csv')
df.head(3)
#Get the number of rows and columns
print(f'Dataset size: {df.shape}')
(1338,7)

Dataset has 1338 records and 6 features. Smoker, sex, and region are categorical variables while age, BMI, and children are numeric.
Handling Null/Missing Values
Let’s examine the proportion of missing values in the dataset:
df.isnull().sum().sort_values(ascending=False)/df.shape[0]

Age and BMI have some null values – very few though. We will handle this missing data and then begin our data analysis. Sklearn’s SimpleImputer allows you to replace missing values based on mean/median/most frequent values in the respective columns. In this example, I am using the median to fill null values.
#Instantiate SimpleImputer
si=SimpleImputer(missing_values = np.nan, strategy="median")
si.fit(df[['age', 'bmi']])
#Filling missing data with median
df[['age', 'bmi']] = si.transform(df[['age', 'bmi']])
Data Visualization
Now that our data is clean, we will look at analyzing data through visualizations and maps. A simple seaborn pairplot can give us a lot of insights!
sns.pairplot(data=df, diag_kind='kde')

What do we see..?
- Charges and children are skewed.
- Age shows a positive correlation with Charges.
- BMI follows a normal distribution! 😎
Seaborn’s boxplot and countplot can be used to bring out the impact of categorical variables on charges.


Observations based on the above plots:
- Males and females are almost equal in number and on average median charges of males and females are also the same, but males have a higher range of charges.
- Insurance charges are relatively higher for smokers.
- Charges are highest for people with 2–3 children
- Customers are almost equally distributed across the 4 regions and all of them ** have almost the same charge**s.
- Percentage of female smokers is less than the percentage of male smokers.
Thus, we can conclude that ‘smoker’ has a considerable impact on the insurance charges, while gender has the least impact.
Let’s create a heatmap to understand the strength of the correlation between charges and numeric features — age, BMI, and children.
sns.heatmap(df[['age', 'bmi', 'children', 'charges']].corr(), cmap='Blues', annot=True)
plt.show()

We see that age and BMI have an average +ve correlation with charges.
We will now go over the steps of model preparation and model development one by one.
- Feature Encoding
In this step, we convert categorical variables – smoker, sex, and region – to numeric format(0, 1,2, 3, etc.) as most of the algorithms cannot handle non-numeric data. This process is called encoding and there are many ways to do this :
- LabelEncoding – Represent categorical values as numbers (For example, a feature such as Region with values Italy, India, USA, UK can be represented as 1, 2, 3, 4)
- OrdinalEncoding – Used for representing rank-based categorical data values as numbers. (For example representing high, medium, low as 1,2,3)
- One-hot Encoding – Represent categorical data as binary values – 0s,1s only. I prefer to use one-hot encoding over label encoding if there aren’t many unique values in the categorical feature. In here, I have used pandas’ one hot encoding function (get_dummies) on Region and split it into 4 columns – location_NE, location_SE, location_NW, and location_SW. One can also use label encoding for this column, however, one hot encoding gave me a better result.
#One hot encoding
region=pd.get_dummies(df.region, prefix='location')
df = pd.concat([df,region],axis=1)
df.drop(columns='region', inplace=True)
df.sex.replace(to_replace=['male','female'],value=[1,0], inplace=True)
df.smoker.replace(to_replace=['yes', 'no'], value=[1,0], inplace=True)
2. Feature Selection and Scaling
Next, we will select features that affect ‘charges’ the most. I have selected all the features except gender as its effect on ‘charges’ is very less(concluded from the viz charts above). These features will form our ‘X’ variable while charges will be our ‘y’ variable. If there are many features, I suggest you use scikit-learn’s SelectKBest for feature selection to arrive at the top features.
#Feature Selection
y=df.charges.values
X=df[['age', 'bmi', 'smoker', 'children', 'location_northeast', 'location_northwest', 'location_southeast', 'location_southwest']]
#Split data into test and train
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)
Once we have selected our features, we need to ‘standardize’ the numeric ones — age, BMI, children. Standardization process converts data to smaller values in the range 0 to 1 so that all of them lie on the same scale and one doesn’t overpower the other. I have used StandardScaler here.
#Scaling numeric features using sklearn StandardScalar
numeric=['age', 'bmi', 'children']
sc=StandardScalar()
X_train[numeric]=sc.fit_transform(X_train[numeric])
X_test[numeric]=sc.transform(X_test[numeric])
Now, we are all set to create our first basic model😀 . We will try Linear Regression and DecisionTrees to predict insurance charges

Mean absolute error (MAE) and root-mean-square error (RMSE) are the metrics used to evaluate regression models. You can read more about it here. Our baseline models give a score of more than 76%. Between the 2, DecisionTrees give a better MAE of 2780. Not bad..!
Let’s see how can we make our model better.
We can improve our model score by manipulating some of the features in the dataset. After a couple of trials, I found that the following items improve accuracy:
- Grouping similar customers into clusters using KMeans.
- Clubbing northeast and northwest regions into ‘north’ and southeast and southwest into ‘south’ in Region column.
- Transforming ‘children’ into a categorical feature called ‘more_than_one_child’ which is ‘Yes’ if the number of children is > 1
from sklearn.cluster import KMeans
features=['age', 'bmi', 'smoker', 'children', 'location_northeast', 'location_northwest', 'location_southeast', 'location_southwest']
kmeans = KMeans(n_clusters=2)
kmeans.fit(df[features])
df['cust_type'] = kmeans.predict(df[features])
df['location_north']=df.apply(lambda x: get_north(x['location_northeast'], x['location_northwest']), axis=1)
df['location_south']=df.apply(lambda x: get_south(x['location_southwest'], x['location_southeast']), axis=1)
df['more_than_1_child']=df.children.apply(lambda x:1 if x>1 else 0)

From our EDA, we know that the distribution of ‘charges’ (Y) is highly skewed and hence we will apply scikit-learn’s target transformer – QuantileTransformer to normalize this behavior.
X=df[['age', 'bmi', 'smoker', 'more_than_1_child', 'cust_type', 'location_north', 'location_south']]
#Split test and train data
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
regr_trans = TransformedTargetRegressor(regressor=model, transformer=QuantileTransformer(output_distribution='normal'))
regr_trans.fit(X_train, y_train)
yhat = regr_trans.predict(X_test)
round(r2_score(y_test, yhat), 3), round(mean_absolute_error(y_test, yhat), 2), round(np.sqrt(mean_squared_error(y_test, yhat)),2)
>>0.843, 2189.28, 4931.96
Woahh… a whopping 84%…and MAE has reduced to 2189!
4. Use of Ensemble and Boosting Algorithms
Now we will use these features on ensemble-based RandomForest, GradientBoosting, LightGBM, and XGBoost. If you are a beginner and not aware of boosting and bagging methods, you can read more about them here.
X=df[['age', 'bmi', 'smoker', 'more_than_1_child', 'cust_type', 'location_north', 'location_south']]
model = RandomForestRegressor()
#transforming target variable through quantile transformer
ttr = TransformedTargetRegressor(regressor=model, transformer=QuantileTransformer(output_distribution='normal'))
ttr.fit(X_train, y_train)
yhat = ttr.predict(X_test)
r2_score(y_test, yhat), mean_absolute_error(y_test, yhat), np.sqrt(mean_squared_error(y_test, yhat))
>>0.8802, 2078, 4312
Yes! Our RandomForest model does perform well – MAE of 2078👍 . Now, we will try with some boosting algorithms such as Gradient Boosting, LightGBM, and XGBoost.

All of them seem to perform well:)
5. Hyperparameter Tuning
Let’s tweak some of the algorithm parameters such as tree depth, estimators, learning rate, etc, and check for model accuracy. Manually trying out different combinations of parameter values is very time-consuming. Scikit-learn’s GridSearchCV automates this process and calculates optimized values for these parameters. I have applied GridSearch to the above 3 algorithms. Below is the one for XGBoost:

Once we have optimum values for our parameters, we will run all 3 models again with these values.

This looks much better! We have been able to improve our accuracy – XGBoost gives a score of 88.6% with relatively fewer errors 👏👏
Distribution and Residual plots confirm that there is a good overlap between predicted and actual charges. However, there are a handful of predicted values that are way beyond the x-axis and this makes our RMSE is higher. This can be reduced by increasing our data points i.e. collecting more data.
We are now ready to deploy this model into production and test it on unknown data. Well done👍
In a nutshell, the points that improved the accuracy of my model
- Creating simple new features
- Transforming target variable
- Clustering common data points
- Use of boosting algorithms
- Hyperparameter tuning
You can access my notebook here. Not all of them may work for your model every time. Pick and choose the ones that work best for your scenario:)