Approaching a Machine Learning Problem

A BEGINNER FRIENDLY GUIDE

Rupak Karki
Towards Data Science

--

Photo by Andy Kelly on Unsplash

So, you’ve learned Python and many of the libraries like NumPy, Pandas, Matplotlib, Seaborn, scikit-learn and you’re ready to start building ML projects but you don’t have an idea about where to start and how to turn data into models. In that case, this article has you covered. Here, we will be going over some of the basic steps we should cover to build a model that generalizes well to unseen data.

Here, we will explore the following topics:

  • Loading and Exploring the data
  • Data cleaning and Preprocessing
  • Exploratory Data Analysis (EDA)
  • Model Building

What and How?

We will be building a model using the UCI heart disease dataset from Kaggle that can detect whether someone has a heart disease or not based on different features. The original source of this dataset is here.

Kaggle is an online community of data scientists and machine learning practitioners where one can find free and public datasets to practice as well as connect with like minded people.

We will be using Pandas to manipulate our data, NumPy for numeric Python, Matplotlib and Seaborn for data visualization and scikit-learn for model building and training . We will be using Jupyter Notebook for the whole process.

Let’s get started!!

Loading and Exploring the data

I have saved the dataset as heart.csv in the same directory as the notebook. Let’s load it using Pandas.

import pandas as pd
heart = pd.read_csv('heart.csv')

heart here is the pandas DataFrame that is created out of our csv file. Using this DataFrame, we can access several methods and properties of our data. Let us have a look at the shape of our data.

heart.shape
(303, 14)

The tuple (303, 14) means we have 303 rows and 14 columns. To be more specific, our dataset contains 303 records and 14 features. We can use the following code to view the top and bottom five rows in our data.

heart.head()
heart.tail()
print("The study on this data is done on people between {0} and {1} of age.".format(heart['age'].min(),heart['age'].max()))

The print statement prints:

The study on this data is done on people  between 29 and 77 of age.

We can use our heart DataFrame to access the records and their values. Here, heart[‘age’].min() and heart[‘age’].max() select the minimum and maximum age in the age column of the dataset respectively.

Kaggle is one of the top platforms to acquire public datasets and practice your skills. You can also look at other peoples notebooks and learn a lot about dealing with certain type of problems.

Data Cleaning and Preprocessing

In this step of our project, we will prepare our data for the next steps. Cleaning and Preprocessing generally include checking for and dealing with null values, dealing with outliers in our data and getting the data ready for our model.

The task of data processing and cleaning will vary between datasets. In some cases, you may have to do nothing and in some you may have to alter the whole structure of the data. It depends.

Null Values

In this UCI heart disease dataset, fortunately, there are no null values. To check if the data has null values you can do:

heart.isnull().sum()

This will return a series with the number of null values. (In our case, 0).

Showing Null Values (Image by Author)

Renaming Features

The columns in this data are hard to understand because many of them are in short forms. Let’s change the column names for better understanding. This will also help us visualize our data better.

col_name_map = {'cp': 'chest_pain_type',
'trestbps': 'rest_bp',
'chol': 'serum_cholestrol',
'fbs': 'fasting_blood_sugar',
'restecg': 'rest_ecg_level',
'thalach': 'max_heart_rate',
'exang': 'exercise_induced_angina',
'oldpeak': 'st_depression',
'slope': 'st_slope_type',
'ca': 'major_vessel_count',
'thal': 'thalium_stress_test_result',
'target': 'diagnosis'}
heart.rename(mapper=col_name_map, axis=1, inplace=True)

We mapped the actual column names to those that we can understand in a python dictionary. The rename method for a DataFrame will map our dictionary and change the values in our data. We specify axis=1 for the column and inplace=True to permanently change the names.

Changing categorical data

Categorical features are those whose value is not continuous. They take one of the limited and usually fixed number of values. In our dataset, we have a lot of categorical values, but they are denoted by numbers and we need to change it to something we can understand and make it easier to build our model.

We use the .loc method for a DataFrame to locate the features with the values we need to change and assign our categories.

Exploratory Data Analysis

EDA is the process of analyzing the dataset and gain insight on the data at hand, often using visual methods. We try to summarize the main characteristics of our data. Basically, we explore every nook and cranny of our dataset. We use tools like Matplotlib, Seaborn to visualize our data. We also try to find the correlations between features.

Visualization

Let us convert the data type for the categorical features. We make a list of categorical columns and iterate through them to change the datatype.

cat_cols = ['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg_level', 'exercise_induced_angina', 'st_slope_type', 'thalium_stress_test_result']
for col in cat_cols:
heart[col] = heart[col].astype('object')

Before plotting, let us import some libraries.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let us see the countplot of these categorical features.

The code above generates the following plot:

Countplot for categorical columns (Images by Author)

Correlation

corr = heart.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr, annot=True)

This code above plots the following heatmap with the correlation between features.

Correlation between Features (Image by Author)

There are more plots on the notebook in the GitHub repo. Some are explained as well. You can visit the repository by clicking here and look into it.

Model Building

This is one of the most important step in this whole process. Here, we try to build a model that can generalize well to unseen data using the data we cleaned and preprocessed earlier.

Data cleaning, preprocessing and feature engineering are some of the most important steps in machine learning. The quality of data fed into the algorithm can make or break a model.

Getting the data ready

Before we feed the data to the algorithm, we need to perform some additional steps so that our model generalizes better.

  1. Separating Target Variable
# Separating target variable
X = heart.drop('diagnosis', axis=1)
y = heart.diagnosis

We separate the rest of the data with out results. X is the input data and y is the output for the corresponding row.

2. Dummies

# Getting dummies from categorical Values
X = pd.get_dummies(X)

Categorical values need to be dealt with extra caution. Our algorithm will only accept numeric values as input. So we will use the pandas inbuilt method get_dummies to deal with them. It is also referred to as Dummy Coding. You can read more about categorical features and the ways to deal with them here.

3. Split into train and test set

# Splitting Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

We need to also split our data into training sets and test sets. Our model will be trained on the training set and we can later test the model on our test set.

Random Forest Classifier

Random forests are built upon the idea of decision trees used for classification, regression and other tasks. During training, the algorithm builds multiple decision trees and outputs the mode of the classes of the individual trees. Read more about it here.

RandomForestClassifier is our preference for building this model. We use the classifier because the task at hand is classifying whether a diagnosis is heart disease or not.

Regression tasks are for predicting continuous values such as house price, car price, land price, etc. Classification tasks are predicting the class or group the observation falls under.

Training

from sklearn.ensemble import RandomForestClassifierclassifier = RandomForestClassifier()
classifier.fit(X_train, y_train)

This will train the RandomForestClassifier with our training data without any hyper-parameter tuning.

To see how well it performs, we can use the following code:

score = classifier.score(X_test, y_test)
print("Score on the test set: {}".format(score))

Conclusion

We looked into some of the basics of approaching a machine learning problem in this article. The topics and steps we covered are for a beginner level machine learning problem where we have structured data. In a big project, there are a lot of steps that need to be performed to make the right model.

What we went through is a general process that can be applied to any ML problem. The vast majority of time is spent in making the data perfect for the model. We did not cover all of the data cleaning and Feature engineering steps here because our data was structured. Real world data can be unforgiving and we may have to do much more to get predictions out of it.

This article is here as a starting point for the beginners out there that are looking into starting their very first project. Even if it does not cover all of the topics, it will serve as a guide for those starting on this exciting journey.

The Jupyter Notebook of this project available in this GitHub repository has more explanation on the code and the method. Make sure to check it out and open issues if you find any. As always, feedback and suggestions are welcomed and encouraged.

To know more about me. Click here for my portfolio.

--

--