Motivation
Back in August 2020, IBM launched the event "Maratona Behind The Code" for Latin America in which participants would solve business challenges using IBM’s services, create predictive models, etc. One of the challenges was for "Digital House", a coding school requesting for a Machine Learning model that can predict how much time their students need to find a job after they have finished one of their courses. I managed to rank 48th.
Link to the Digital House challenge on GitHub

Introduction
In this article, I go through different steps needed for a complete ML project using a dataset from real competition.
This article will be divided into 3 parts:
- Exploratory Data Analysis
- Preprocessing the data
- Model exploration and evaluation
First of all, we download the dataset:
import pandas as pd
import numpy as np
!wget - no-check-certificate - content-disposition https://raw.githubusercontent.com/vanderlei-test/654986294958/master/train_dataset_digitalhouse.csv
df_training_dataset = pd.read_csv('r'train_dataset_digitalhouse.csv'')
df_training_dataset.head()

Removing useless id column "UNNAMED: 0":
columns=['Unnamed: 0']
df_training_dataset_1 = df_training_dataset.drop(columns=columns, inplace=False)
df_training_dataset.shape
#Output: (8995, 10)
We have to predict the continuous variable "DIAS_EMP".
Some quick observations before exploring the data:
- There are many categorical variables.
- The data appears to have many missing values.
Exploratory Data Analysis
Let’s get some general info on the data:

Some things to consider when doing exploratory data analysis:
- Try to understand the different features and their relationship with each other.
- Look for variables that may help predict the target.
Visualize numerical data
Now let’s look at the correlation between numerical variables:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
num_features = ["EDAD", "AVG_DH", "MINUTES_DH", "EXPERIENCIA", "DIAS_EMP"]
cat_features = ["GENERO", "NV_ESTUDIO", "ESTUDIO_PREV", "TRACK_DH", "RESIDENCIA"]
target = "DIAS_EMP"
df_training_dataset_1[num_features].corr()

Now let’s plot the numerical features against each other, that way we can more clearly see their relations with each other:
sns.pairplot(df_training_dataset_1)

Some observations:
- Most features are already normally distributed, so we may not need to deal with that.
- There’s a clear relationship between the column pairs [EDAD, EXPERIENCIA] and [MINUTES_DH, AVG_DH], which suggests we may be able to find a non-linear function that describes one feature given the other one.
- Although MINUTES_DH and AVG_DH are correlated, they both have a low correlation with the target variable, so dropping them may be worth considering.
Visualize categorical data
We start by using count plots with our categorical data:
fig, ax = plt.subplots(1, 5, figsize=(20, 5))
for variable, subplot in zip(cat_features, ax.flatten()):
sns.countplot(data = df_training_dataset_1, x = variable, ax=subplot)
for label in subplot.get_xticklabels():
label.set_rotation(90)

We now study how the feature varies across the different categories:
fig, ax = plt.subplots(1, 5, figsize=(20, 5))
for variable, subplot in zip(cat_features, ax.flatten()):
sns.kdeplot(data = df_training_dataset_1, x = target, ax=subplot, hue = variable)
for label in subplot.get_xticklabels():
label.set_rotation(90)

g = sns.PairGrid(df_training_dataset_1, y_vars="DIAS_EMP",
x_vars= cat_features,
height=5, aspect=.5)
# Draw a seaborn pointplot onto each Axes
g.map(sns.pointplot, scale=1.3, errwidth=4, color="xkcd:plum")
sns.despine(fig=g.fig, left=True)
for ax in g.axes.flatten():
ax.tick_params(rotation = 90)

Some observations:
- The target ("DIAS_EMP") varies greatly across the features "GENERO", "NV_ESTUDIO", and "ESTUDIO_PREV".
- The target varies very slightly under "RESIDENCIA" and "TRACK_DH", so we will drop them.
df_training_dataset_1.drop(columns = ["RESIDENCIA", "TRACK_DH"], inplace = True)
Data preprocessing
In this step we will:
- Impute missing data.
- Do feature engineering.
The danger of dropping missing values is that our model may not learn all the features properly. Let’s see how many rows we’re getting rid of if we just drop any row with nulls:
df_training_dataset_1.dropna().shape
#Output: (2863, 8)
We would lose two-thirds of our data by dropping all rows with null values.
Imputing missing categorical values
The most popular ways of imputing categorical values are:
- filling missing data with the most frequent one
- filling missing data with another category (ex. "missing")
- predict them based on other variables
In this article, we’re going to try only the first two strategies, plot them to visualize results, and try an early LR to check performance.
#Print some metrics
from sklearn.metrics import accuracy_score,mean_squared_error, r2_score
def model_metrics(y_test,y_pred): #Calculate some metrics for the model
mse = mean_squared_error(y_test,y_pred)
print("Mean squared error: %.4f"
% mse)
rmse = mse/2
print('RMSE score: %.4f' % rmse )
#Since this is a regression problem, we'll use an LR predictor inside a function for easy testing the changes
#It just tests the basic LinearRegression model on demand
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def test_lineal_regression(features, target, model = LinearRegression()):
#Divide the data set for testing
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state = 133)
#Create and fit basic LR model
model = model.fit(X_train, y_train)
scores=cross_val_score(model,X_train,y_train,cv=5,n_jobs=-1)
#Evaluate model
y_pred = model.predict(X_test)
#print(str(len(model.coef_))+" features: "+ str(model.coef_))
print("Cross Val R2 Score: "+ str(scores.mean().round(4)))
model_metrics(y_test, y_pred)
Let’s first run a Linear Regression Model with the categorical variables, but dropping all null values to see the performance we could be looking up to.
#First trainning without any missing data
#from sklearn.preprocessing import OneHotEncoder
cat_features = ["GENERO", "NV_ESTUDIO", "ESTUDIO_PREV"]
df_cat_impute = df_training_dataset_1[cat_features].copy()
df_cat_impute[target] = df_training_dataset_1[target]
df_cat_dropna = df_cat_impute.dropna()
cat_dropna_target = df_cat_dropna[target]
df_cat_dropna = df_cat_dropna.dropna()
df_cat_dropna = df_cat_dropna.drop(columns = target)
#df_cat_dropna2 = OneHotEncoder().fit_transform(df_cat_dropna["GENERO"].values.reshape(-1,1))
df_cat_dropna2 = OneHotEncoder().fit_transform(df_cat_dropna)
test_lineal_regression(df_cat_dropna2, cat_dropna_target)
"""
Output:
Cross Val R2 Score: 0.7339
Mean squared error: 8.5234
RMSE score: 4.2617
"""
Not a bad performance considering it using only the categorical features. Let’s now impute with the "most frequent" category:
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy = "most_frequent")
df_cat_impute2 = pd.DataFrame(
data = cat_imputer.fit_transform(df_cat_impute),
columns = df_cat_impute.columns )
#Now to visualize it's results:
df_cat_impute2[target] = df_training_dataset_1[target]
fig, ax = plt.subplots(1, 3, figsize=(20, 5))
for variable, subplot in zip(cat_features, ax.flatten()):
sns.countplot(data = df_cat_impute2, x = variable, ax=subplot)
for label in subplot.get_xticklabels():
label.set_rotation(90)

Let’s try and train this one:
df_cat_impute3 = df_cat_impute2.drop(columns = [target])
df_cat_impute3 = OneHotEncoder().fit_transform(df_cat_impute3)
test_lineal_regression(df_cat_impute3, df_training_dataset_1[target])
"""
Output:
Cross Val R2 Score: 0.5737
Mean squared error: 13.2366
RMSE score: 6.6183
"""
So we’ve got a difference of -0.16 in the r2 score, a really big one considering we have almost half of the dataset with at least one value that was filled. Let’s now try with imputing by creating another category among the variables. (null values are replaced with the constant "missing").
#Now fill with "missing values"
df_cat_impute = df_training_dataset_1[cat_features].copy()
miss_imputer = SimpleImputer(strategy="constant", fill_value="MISSING")
df_inputed_missing = pd.DataFrame(data= miss_imputer.fit_transform(df_cat_impute), columns = cat_features)
df_inputed_missing_encoded = OneHotEncoder().fit_transform(df_inputed_missing)
test_lineal_regression( df_inputed_missing_encoded, df_training_dataset_1[target])
"""
Output:
Cross Val R2 Score: 0.618
Mean squared error: 12.2891
RMSE score: 6.1445
"""
This model shows a -0.12 difference in the r2 score, it’s still a big difference but this is the superior one compared to imputing with the most frequent value, so we’re going to stick with this one.
Imputing numerical values
There are multiple strategies we can try, for example:
- Fill with Mean
- Fill with Median
- Fill with Mode
- Predict them
As before, we’re going to try train a model without any missing values to have a starting comparison, then we’ll try the different strategies described above
#Try predict without any missing value for numerical features
df_num_impute = df_training_dataset_1[num_features].dropna().copy()
num_impute_target = df_num_impute[target]
df_num_impute.drop(columns=[target], inplace=True)
test_lineal_regression(df_num_impute, num_impute_target)
"""
Output:
Cross Val R2 Score: 0.182
Mean squared error: 26.1434
RMSE score: 13.0717
"""
So far a very poor performance, which we’ll improve definitely once we mix the categorical features with the numerical ones. Now we try the different approaches, visualize the effects on the dataset, and train the model.
Filling null values with mean and visualizing:
#Fill missing numeric features with with mean values
df_num_impute = df_training_dataset_1[num_features].copy()
mean_imputer = SimpleImputer(strategy="mean")
df_num_impute = pd.DataFrame(data = mean_imputer.fit_transform(df_num_impute), columns = num_features)
sns.pairplot(df_num_impute)

By looking at the distribution plots and the plots between the pairs [EDAD, EXPERIENCIA] and [MINUTES_DH, AVG_DH] we can see that there’s something wrong, the existing patterns are disrupted. This was done only by imputing with the mean, if we use any other constant value (like the mode or median), we’ll get a similar result. Let’s find the performance for this one:
num_mean_target = df_num_impute[target]
df_num_impute2 = df_num_impute.drop(columns = [target])
test_lineal_regression(df_num_impute2, df_training_dataset_1[target])
"""
Output:
Cross Val R2 Score: 0.16
Mean squared error: 27.0971
RMSE score: 13.5485
"""
A difference of -0.022 in r2 score, not so big but let’s see what the other strategies offer. Since we’ve seen the effect of imputing with a constant, let’s jump straight to training with them.
#fill with constants for mode and median
df_num_impute = df_training_dataset_1[num_features].copy()
mode_imputer = SimpleImputer(strategy="most_frequent")
median_imputer = SimpleImputer(strategy="median")
df_num_impute_mode = pd.DataFrame(data = mode_imputer.fit_transform(df_num_impute), columns = num_features)
df_num_impute_median = pd.DataFrame(data = median_imputer.fit_transform(df_num_impute), columns = num_features)
df_num_impute_median2 = df_num_impute_median.drop(columns = [target])
df_num_impute_mode2 = df_num_impute_mode.drop(columns = [target])
print("MODE: ")
test_lineal_regression(df_num_impute_mode2, df_training_dataset_1[target])
print("MEDIAN: ")
test_lineal_regression(df_num_impute_median2, df_training_dataset_1[target])
"""
Output:
MODE:
Cross Val R2 Score: 0.1595
Mean squared error: 27.0820
RMSE score: 13.5410
MEDIAN:
Cross Val R2 Score: 0.1598
Mean squared error: 27.0864
RMSE score: 13.5432
"""
A somewhat similar performance compared to filling with mean.
Let’s try by using a predictor, in this case, the KNN Imputer from scikit. Since the pairs ["EDAD", "EXPERIENCIA"] and ["MINUTES_DH", "AVG_DH"] only have a high correlation with each other, we’re going to predict only between those two and join the results.
#fill with imputer
from sklearn.impute import KNNImputer
imputer = KNNImputer()
cols_to_impute = ["EDAD","EXPERIENCIA"]
df_edad_exp_filled_knn1 = pd.DataFrame(data = imputer.fit_transform(df_training_dataset_1[cols_to_impute]), columns = cols_to_impute)
cols_to_impute = ["MINUTES_DH","AVG_DH"]
df_dh_filled_knn2 = pd.DataFrame(data = imputer.fit_transform(df_training_dataset_1[cols_to_impute]), columns = cols_to_impute)
df_knn = df_edad_exp_filled_knn1.join(df_dh_filled_knn2)
Visualizing it:
df_knn[target] = df_training_dataset_1[target]
sns.pairplot(df_knn)

Now the plots don’t show any weird effects, as the missing values were predicted. Let’s see its performance:
df_knn2 = df_knn.drop(columns = [target])
test_lineal_regression(df_knn2, df_training_dataset_1[target])
"""
Output:
Cross Val R2 Score: 0.1689
Mean squared error: 26.6045
RMSE score: 13.3023
"""
This model has an improvement of +0.009 in the r2 score over the previous approaches. We’re going to stick to this change. We proceed to do some feature transformations.
Feature Engineering
Here we apply some transformations over the numeric data in an attempt to improve performance. We have seen the relationship between EDAD-EXPERIENCIA and MINUTES_DH-AVG_DH, both of them appear to apply a polynomial Transformation, which we can see using a reg plot:


We can simulate this effect with scikit’s PolynomialFeatures, using a degree of 3 respectively. However, we must do them separately, since we are not using "EDAD" to predict "MINUTES_DH". Since this is the final transformation we also call a Standard Scaler on the numeric features:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
df_knn_imputed = df_knn2.copy()
df_knn_imputed = pd.DataFrame(StandardScaler().fit_transform(df_knn_imputed),columns = df_knn_imputed.columns)
df_knn_imputed_1 = pd.DataFrame(PolynomialFeatures(degree = 3).fit_transform(df_knn_imputed[["EDAD", "EXPERIENCIA"]]))
df_knn_imputed_2 = pd.DataFrame(PolynomialFeatures(degree = 3).fit_transform(df_knn_imputed[[ "MINUTES_DH", "AVG_DH"]]))
df_knn_imputed = df_knn_imputed_2.join(df_knn_imputed_1, rsuffix = "E-")
test_lineal_regression(df_knn_imputed, df_training_dataset_1[target])
"""
Output:
Cross Val R2 Score: 0.1766
Mean squared error: 26.1464
RMSE score: 13.0732
"""
With Polynomial Features we have an improvement of +0.007 in the r2 score over pure imputations. Now we join the datasets with categorical and numerical features to see how it performs with all the improvements done.
df_full = df_knn_imputed.join(df_inputed_missing_encoded)
test_lineal_regression(df_full, df_training_dataset_1[target])
"""
Output:
Cross Val R2 Score: 0.8122
Mean squared error: 5.7783
RMSE score: 2.8892
"""
A fairly decent performance, let’s continue with model exploration.
Model Exploration and Evaluation
Here we will use different scikit models for regression and compare them, then we also use our custom LR model to compare it.
Scikit’s models
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor()
regr = RandomForestRegressor()
svm = svm.SVR()
en = ElasticNet(random_state = 0)
lasso = Lasso()
r = Ridge()
models = [mlp,regr,svm,en,lasso,r, LinearRegression()]
for m in models:
print(" - - - - - - ")
print(m)
test_lineal_regression(df_full, df_training_dataset_1[target], model = m)

From all the models seen, the basic LinearRegression() and Ridge() linear models are the strongest ones, let’s further analyze the Ridge model by tuning its hyperparameters using GridSearchCV. We’ll use the r2 metric for scoring:
from sklearn.model_selection import GridSearchCV
params_Ridge = {'alpha': [100,10,1,0.1,0.01,0.001,0.0001,0],
"fit_intercept": [True, False],
"solver": ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga'],
"tol": [0.01, 0.005, 0.001, 0.005, 0.0001]
}
Ridge_GS = GridSearchCV( Ridge(), param_grid = params_Ridge, n_jobs = -1, cv = 5, scoring = 'r2')
Ridge_GS.fit(df_full, df_training_dataset_1[target])
df_results = pd.DataFrame(Ridge_GS.cv_results_)
Select and display the top 15 parameter combinations:
pd.set_option('display.max_rows', 500)
df_results2 = df_results[["param_alpha", "param_fit_intercept", "param_solver", "param_tol", "mean_test_score", "std_test_score", "rank_test_score"]]
df_results2.sort_values(["rank_test_score"]).head(15)

A high mean test score means the model accounts for a higher percentage of the dataset, a low std test score means our model is unlikely to be overfitting. Let’s see the best model:

Now that we got the best hyperparameters for our model, let’s see some other metrics and visualize the results:
X_train, X_test, y_train, y_test = train_test_split(df_full, df_training_dataset_1[target], test_size=0.25, random_state = 133)
model = Ridge(**Ridge_GS.best_params_).fit(X_train, y_train)
y_pred = model.predict(X_test)
model_metrics(y_test, y_pred)
"""
Output:
Mean squared error: 5.7713
RMSE score: 2.8856
"""
Let’s graph some results to compare the predictions vs the actual values:
import matplotlib.pyplot as plt
inicio = 0
cant = 50
y = y_test[inicio:inicio + cant]
y_p = y_pred[inicio:inicio + cant]
x_ax = range(len(y))
plt.figure(figsize=(20, 4))
plt.plot(x_ax, y, label="original",marker='o')
plt.plot(x_ax, y_p, label="predicted",marker='o')
plt.title("Digital House")
plt.legend()
plt.show()

Custom LR model
In our custom linear regression model, we do some exploration over the "iterations" and "learning rate" hyperparameters:
class LinearRegressionByHand():
def __init__(self, learning_rate = 0.0001, iterations = 10000):
self.learning_rate = learning_rate
self.iterations = iterations
def compute_loss(self, y, y_true):
loss = np.sum(np.square(y - y_true))
return loss / ( 2 * self.m)
def predict(self, X): #return y = b + w1x1 + w2x2 + ... + wnxn
y = self.b + np.dot(X, self.W)
return y
def gd(self, X, y_true, y_hat): #Compute the derivatives from the loss function
db = np.sum(y_hat - y_true) / self.m
dW = np.sum(np.dot(np.transpose(y_hat - y_true), X), axis=0)/ self.m
return dW, db
def update_params(self, dW, db):
self.W = self.W - self.learning_rate * np.reshape(dW, (self.n, 1))
self.b = self.b - self.learning_rate * db
def fit(self, x_train, y_train):
y_train = y_train.values.reshape(-1,1)
self.m, self.n = x_train.shape # m is # of rows. n is # of features
self.W = np.random.randn(self.n, 1) # params
self.b = np.random.randn() # bias
losses = []
for i in range(self.iterations):
y_hat = self.predict(x_train)
loss = self.compute_loss(y_hat, y_train)
dW, db = self.gd(x_train, y_train, y_hat)
self.update_params(dW, db)
losses.append(loss)
return losses
Iterating over its hyperparameters:
results = pd.DataFrame()
iterations = [1000, 10000, 100000]
learning_rates = [10e-4, 10e-5, 10e-6]
for i in iterations:
for lr in learning_rates:
print("Training iter " + str(i) + " alpha " + str(lr))
model = LinearRegressionByHand(iterations = i, learning_rate = lr)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2, mse = r2_score(y_test, y_pred), mean_squared_error(y_test, y_pred)
results = results.append([[i, lr, r2, mse]])
results.columns = columns
results.sort_values(["R2"], ascending = False)

The best custom LR model’s performance is comparable to the one we have from scikit.
Additional:
If you want to deploy your model with IBM’s ML services directly from the Jupyter notebook, I recommend looking at the following tutorial.
Conclusions
We have done Exploratory Data Analysis (EDA) over a dataset from a data science competition, we’ve come up with decent strategies for imputing missing data and further data transformations, all based on observations we’ve gathered from looking at different plots of the data. After we were done with the data we did some model exploration to find the best possible one and we have also built a custom LR model that has comparable performance to the one from scikit.
Further things that could be done to improve the current performance:
- Explore the correlation, or a similar metric, between categorical and numeric features.
- Find a way to predict missing categorical data, if only to compare to the current model.
- Use another ML model to predict the numerical features, like a Linear Regression with Polynomial features, instead of a simple KNN imputer.
- Find interaction variables between numerical and categorical features.
*All images used in this article are direct screenshots from a jupyter notebook authored by me, the notebook with everything in this article can be found here.