The world’s leading publication for data science, AI, and ML professionals.

Understanding the Process of Building a Machine Learning Model for a Real World Scenario

A practical step-by-step guide to building a Linear Regression Model to predict Apparent Temperature from Weather History Data

Photo by Carlos Muza on Unsplash
Photo by Carlos Muza on Unsplash

This article focuses on finding the relationships between particular features and making predictions with a weather history dataset after going through all the necessary steps,

  • Data Cleansing
  • Data Transformations
  • Feature Coding
  • Feature Scaling
  • Feature Discretization
  • Dimensionality Reduction
  • Linear Regression Model
  • Model Evaluation

Here the goal is to find, is there a relationship between humidity and temperature, and is there a relationship between humidity and apparent temperature? And also to find whether the apparent temperature can be predicted with the given humidity.

The dataset used here is the weather history dataset from Kaggle. You can find it by clicking here.

First, import the required python libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import cross_val_score,     cross_val_predict
from sklearn import metrics

Then import the dataset into Colab and read it.

from google.colab import drive
drive.mount("/content/gdrive")
data_set = pd.read_csv('/content/gdrive/My Drive /data            /weatherHistory.csv')
data_set.tail(10)

The last 10 rows from the dataset can be checked with _data_set.tail(10)_ as below.

Output:

Dataset
Dataset

Then a copy of the original dataset is created as X.

X = data_set.copy()

1. Data Cleansing – Handling missing values, anomalies, and outliers

The dataset should be checked for missing values and the below code will return True if there are any missing values otherwise, return False.

X.isnull().values.any()

The above code outputs True indicating that there are missing values in the dataset. So the columns that have missing values are needed to be identified.

X.isnull().any()

Output:

This means, only the Precip Type column has missing values and all the other columns do not have any missing values.

The percentage of the missing value can be checked as below.

print(X.isnull().sum() * 100 / len(X))

Output:

We can see that the percentage is very low, so we can simply drop all the missing values in the dataset with the below code.

X = X.dropna()

Since the missing value problem is resolved, next, we should check for outliers or anomalies.

X.boxplot(figsize=(18,8))

Output:

Boxplot for all the features
Boxplot for all the features

We can see that the data distribution of the Pressure feature is abnormal since Pressure has been 0. We can check how much data shows anomalies.

X[X['Pressure (millibars)'] == 0].shape[0]

Output: 1288

So these are not outliers. Those values may be the result of some human errors or system failures. So we cannot simply accept them and we cannot drop them as well since then we will miss other features data. So we can use IQR.

IQR or interquartile range is a measurement of variability based on dividing the dataset into different quantiles.

We can calculate the lower limit and upper limit using quantiles. Then we replace the values that are less than the lower limit with the lower limit and the values that are greater than the upper limit with the upper limit. This will work with left-skewed or right-skewed data as well.

fig, axes = plt.subplots(1,2)
plt.tight_layout(0.2)
print("Previous Shape With Outlier: ",X.shape)
sns.boxplot(X['Pressure (millibars)'],orient='v',ax=axes[0])
axes[0].title.set_text("Before")
Q1 = X["Pressure (millibars)"].quantile(0.25)
Q3 = X["Pressure (millibars)"].quantile(0.75)
print(Q1,Q3)
IQR = Q3-Q1
print(IQR)
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
print(lower_limit,upper_limit)
X2 = X
X2['Pressure (millibars)'] = np.where(X2['Pressure  (millibars)']>upper_limit,upper_limit,X2['Pressure (millibars)'])
X2['Pressure (millibars)'] = np.where(X2['Pressure (millibars)'] <lower_limit,lower_limit,X2['Pressure (millibars)'])
print("Shape After Removing Outliers:", X2.shape)
sns.boxplot(X2['Pressure (millibars)'],orient='v',ax=axes[1])
axes[1].title.set_text("After")
plt.show()

Output:

Boxplots before and after applying IQR for Pressure
Boxplots before and after applying IQR for Pressure

Then we need to check outliers or anomalies in Humidity by plotting the box plot.

humidity_df = pd.DataFrame(X2["Humidity"])
humidity_df.boxplot()

Output:

Boxplot of Humidity
Boxplot of Humidity

We can see there is 0.0 humidity. So we should check how many data values are 0.0.

X2[X2['Humidity'] == 0].shape[0]

Output: 22

Given Earth’s climate and weather conditions, it’s impossible for humidity to be zero. And the number of data points with this anomaly is also very less so we can simply drop them.

X2 = X2.drop(X2[X2['Humidity'] == 0].index)

Now we complete cleansing our data.

Usually, after data cleansing, you can move on to data transformations. But here I have done feature coding before splitting the data as it will remove the need to do feature coding separately for training and testing data. The reason for splitting the dataset before applying any transformations is explained later.

2. Feature Coding – Handling categorical features

Categorical features are data that can be grouped into categories. As an example, we can take gender, color, etc. We need to convert those non-numerical values into numerical values in order to train the Machine Learning model. There are 2 traditional techniques to handle categorical data.

  1. One-hot Encoding
  2. Label(Integer) Encoding

I will explain these 2 with examples from our weather history dataset. We have 2 categorical columns in our dataset, Precip Type and Summary. Let’s take the feature ‘Precip Type’.

X2['Precip Type'].value_counts()

Output:

We can see that the data values in ‘Precip Type’ are either rain or snow. So ‘Precip Type’ is a categorical feature.

One-hot Encoding

In this technique, the categorical feature is replaced with new features that have values 0 or 1. Precip Type is replaced with new features rain and snow like the below representation.

Label Encoding

In this technique, each categorical value is converted to a numeric value. It is represented below.

For linear models, One-hot Encoding is more suitable. If we apply Lable Encoding for such a situation then the model will try to identify an order since the feature values are like 0,1,2,3, etc. But if it is a binary category we can use Label Encoding.

Here since Precip Type has only 2 values I can use Label Encoding.

X2['Precip Type']=X2['Precip Type'].astype('category')
X2['Precip Type']=X2['Precip Type'].cat.codes

But as shown below summary has many categories.

final['Summary'].value_counts()

Output:

So we need to do One-hot Encoding for the Summary column. _pd.get_dummies_ will make new feature columns according to each category and then we need to merge new feature columns and drop the Summary column as shown in the below code.

dummies_summary = pd.get_dummies(X2['Summary'])
merged_summary = pd.concat([X2,dummies_summary],axis='columns')
X2 = merged_summary.drop(['Summary'], axis='columns')
X2

Output:

Dataset after applying feature coding
Dataset after applying feature coding

3. Splitting the dataset

Data leakage refers to a mistake make by the creator of a machine learning model in which they accidentally share information between the test and training data-sets.

When we apply transformations, discretization, and scaling, etc to the train and test dataset together (for the whole dataset) it may cause data leakage. To mitigate this problem I split the dataset before transformations to avoid model overfitting.

Apparent Temperature is the target feature. So we need to separate it and remove it from the feature dataset. Here dataset is split into 80% for training and 20% for testing.

target_column = pd.DataFrame(final_summary['Apparent Temperature (C)'])
final_summary = final_summary.drop(['Apparent Temperature (C)'], axis='columns')
X_train, X_test, y_train, y_test = train_test_split(final_summary, target_column, test_size=0.2)

After splitting dataset indexes are need to be reset.

X_train=X_train.reset_index(drop=True)
X_test=X_test.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)

Now we can apply transformations to the training and testing datasets separately.

4. Data Transformation

Variables in real datasets will follow more a skewed distribution. But if the data comes from normal distribution then machine learning models can learn well. So we should check how those variables are distributed using histograms and Q-Q plots. Here I am explaining only the transformations of the training dataset. The same process should be applied to the testing dataset as well.

X3 = X_train.copy() #copy of the training dataset 
Y3 = X_test.copy() #copy of the testing dataset 
stats.probplot(X3["Temperature (C)"], dist="norm", plot=plt)plt.show()

Output:

Q-Q plot for Temperature
Q-Q plot for Temperature

Likewise, for all the features we can analyze data distribution with Q-Q plots to see whether how data varies from the desired distribution(red line). Here we can see Temperature has a normal distribution.

But when we see the Q-Q plot for Wind Speed we can see data is skewed.

Q-Q plot for Wind Speed
Q-Q plot for Wind Speed

Then we can plot histograms as well.

X3.hist(figsize=(15,10))

Output:

Histograms for training data
Histograms for training data

We can see that Humidity is left-skewed and Wind Speed is right-skewed. Visibility also shows some left-skewness but after applying transformations, it did not show a normal distribution. So I did not apply transformations for Visibility.

For right-skewed data, we apply logarithmic transformations and for left-skewed data, we can apply exponential transformations.

logarithm_transformer = FunctionTransformer(np.log1p)
data_new1 = logarithm_transformer.transform(X3['Wind Speed (km/h)'])
X3['Wind Speed (km/h)']=data_new1
X3['Wind Speed (km/h)'].hist()

Here I have used np.log1p since Wind Speed has 0.0 values, and np.log is only for positive numbers.

Output:

Histogram for Wind Speed after applying logarithm transformations
Histogram for Wind Speed after applying logarithm transformations

Now we can see Wind Speed has a normal distribution.

Since Humidity is left-skewed we apply exponential transformations.

exp_transformer = FunctionTransformer(np.exp)
data_new2 = exp_transformer.transform(X3['Humidity'])
X3['Humidity']=data_new2
X3['Humidity'].hist()

Output:

Histogram for Humidity after applying exponential transformations
Histogram for Humidity after applying exponential transformations

Here I have represented how the histograms have been changed after transformations.

Wind Speed:

Humidity:

We can see with Q-Q plots and histograms Loud Cover is always 0.0. So since it has no effect we need to drop Loud Cover in both training and testing data.

X3 = X3.drop(['Loud Cover'], axis='columns')
Y3 = Y3.drop(['Loud Cover'], axis='columns')

Since Formatted Date is a unique value and Daily Summary is not a numerical value and it has so many different values we can drop them both.

X3 = X3.drop(columns=['Daily Summary','Formatted Date'])
Y3 = Y3.drop(columns=['Daily Summary','Formatted Date'])

5. Feature Scaling – Standardization

Feature scaling refers to the methods used to normalize the range of values of independent variables.

Here I have used Standardization for scaling. Standardization transforms the data to make the mean value 0 and unit standard deviation. This step is critical since it avoids the chance of the model being significantly affected by a variable with a large magnitude.

Standardization is not applied to categorical variables. So they should be kept aside before doing standardization for numeric features. Standardization should also be applied to both training and testing data separately.

Here, note that I have used scaler.fit() only for training data since we need the same parameters (mean and standard deviation) used for the training dataset to be applied for the training dataset as well.

X4 = X3.copy()
Y4 = Y3.copy()
to_std_train = X4[['Temperature (C)', 'Humidity','Wind Speed (km/h)', 'Wind Bearing (degrees)', 'Visibility (km)','Pressure (millibars)']].copy()
to_std_test = Y4[['Temperature (C)', 'Humidity','Wind Speed (km/h)', 'Wind Bearing (degrees)', 'Visibility (km)','Pressure (millibars)']].copy()
scaler = StandardScaler()
scaler.fit(to_std_train)
train_scaled = scaler.transform(to_std_train)
test_scaled = scaler.transform(to_std_test)
std_df_train = pd.DataFrame(train_scaled, columns = to_std_train.columns)
std_df_test = pd.DataFrame(test_scaled, columns = to_std_test.columns)
X4[['Temperature (C)', 'Humidity','Wind Speed (km/h)','Wind Bearing (degrees)','Visibility (km)','Pressure (millibars)']]=std_df_train
Y4[['Temperature (C)', 'Humidity','Wind Speed (km/h)','Wind Bearing (degrees)','Visibility (km)','Pressure (millibars)']]=std_df_test

After standardizing data, we can use a correlation matrix for correlation analysis.

sns.heatmap(std_df_train.corr(),annot=True);
Correlation matrix
Correlation matrix

We can that there is a negative correlation between Humidity and Temperature. So to answer the question is there a relationship between humidity and temperature? Yes. There is a negative correlation between humidity and temperature.

6. Discretization

Discretization is used to transform a continuous variable into a discrete variable. Here we can see when plotting the histogram for Wind Bearing, it does not fit to a normal distribution. Machine learning models can perform better when numerical variables with non-standard probability distributions are made discrete. I have used K-means discretization for Wind Bearing.

X5 = X4.copy()
Y5 = Y4.copy()
data1 = pd.DataFrame(X5, columns=['Wind Bearing (degrees)'])
data1 = data1.dropna()
data2 = pd.DataFrame(Y5, columns=['Wind Bearing (degrees)'])
data2 = data2.dropna()
discretizer = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='uniform')
discretizer.fit(data1)
_discretize1 = discretizer.transform(data1)
_discretize2 = discretizer.transform(data2)
X_dis = pd.DataFrame(_discretize1)
Y_dis = pd.DataFrame(_discretize2)
X_dis.hist()

Output:

Histogram for Wind Bearing after discretization
Histogram for Wind Bearing after discretization

We can plot a co-relation matrix to identify the significance of features for the target column which is Apparent Temperature.

Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together.

X5['Wind Bearing (degrees)'] = X_dis
d_data = X5[['Temperature (C)', 'Humidity','Wind Speed (km/h)','Wind Bearing (degrees)','Visibility (km)','Pressure (millibars)']].copy()
d_data['Apparent Temperature (C)'] = y_train
d_data.head(10)
print(d_data.corr())
sns.heatmap(d_data.corr(),annot=True)
Correlation matrix
Correlation matrix

We can see that Humidity and target variable Apparent Temperature has a significantly high correlation. So to answer the question is there a relationship between humidity and apparent temperature? Yes. There is a negative correlation between humidity and apparent temperature.

7. Dimensionality Reduction – Principal Component Analysis

The importance of dimensionality reduction is that we can compress the dataset by removing redundancy and retaining only useful information. Too many input variables can lead to the curse of dimensionality and then the model will not be able to perform well because the model will also learn from noise in the training dataset and be overfitted.

Principal Component Analysis is a powerful technique used for dimensionality reduction, increasing interpretability but at the same time minimizing information loss.

pca = PCA(n_components=12)
pca.fit(X5)
X_train_pca = pca.transform(X5)
X_test_pca = pca.transform(Y5)
principalDf = pd.DataFrame(data = X_train_pca)
principalDf.head(20)

Output:

The value of the parameter _n_components is decided by analyzing the explained_component_ratio__ as below.

pca.explained_variance_ratio_

Output:

PCA explained variance ratio
PCA explained variance ratio

When deciding the number of principal components to use we should consider the above values. The first elements of the array are the high variance dimensions and we should consider the sum of the high variance dimensions to be around 0.95. Here I have used 12 principal components.

8. Linear Regression Model

After completing all the above-discussed steps, a Linear regression model is applied to the dataset.

lm = linear_model.LinearRegression()
model = lm.fit(X_train_pca,y_train)

9. Model Evaluation

When we evaluate our model, an accuracy of 0.990223 can be seen.

accuracy = model.score(X_test_pca, y_test)
print("Accuracy:", accuracy)

Output:

To further evaluate the model we use mean squared error (MSE).

predictions = lm.predict(X_test_pca)
y_hat = pd.DataFrame(predictions, columns=["predicted"])
mse = mean_squared_error(y_test, y_hat)
print('Test MSE: %.3f' % mse)

Output:

And we have used calculated root mean squared error (RMSE) as well.

rmse = sqrt(mean_squared_error(y_test, y_hat))
print('Test RMSE: %.3f' % rmse)

Output:

We can perform 6-fold cross-validation.

scores = cross_val_score(model, X_test_pca, y_test, cv=6)
print("Cross-validated scores:", scores)

Output:

With the below code, I calculated cross-predicted accuracy.

predictions = cross_val_predict(model, X_test_pca, y_test, cv=6)
accuracy = metrics.r2_score(y_test, predictions)
print("Cross-Predicted Accuracy:", accuracy)

The accuracy of the model is good. But it is important to check the weight parameters also.

Weight parameters are an important indicator of the model performing well with the new data. If the weight parameters are less then it means your model is pretty good and will perform better with unseen data.

#W parameters of the model
print(lm.coef_)

Output:

Here, the weight parameters are not very low. But we can further reduce them by identifying and changing some steps we did while preprocessing, transformations, etc.

We can visualize the predictions of the model as below. Here I have plotted the graph only for 100 data points.

plt.figure(figsize=(15, 7))
plt.plot(y_hat[:100], label = "Predicted")
plt.plot(y_test[:100], label = "Actual")
plt.title('Predictions Vs Actual')
plt.legend()
plt.show()

Output:

Predictions vs Actual data graph
Predictions vs Actual data graph

10. Conclusion

In this article, I gave you a step-by-step guide on how to build a machine learning model by applying data cleansing, data transformations, feature coding, feature scaling, discretization, dimensionality reduction and I have also evaluated the linear regression model.

I hope this article helped you to gain a good understanding of the important steps we should follow and the main concepts behind them to build a good machine learning model.

Happy Coding!

11. Colab Code

Google Colaboratory


Related Articles