End-to-End Data Science Example: Predicting Diabetes with Logistic Regression

Published in

Towards Data Science

8 min readMay 24, 2018

As the title suggests, this tutorial is an end-to-end example of solving a real-world problem using Data Science. We’ll be using Machine Learning to predict whether a person has diabetes or not, based on information about the patient such as blood pressure, body mass index (BMI), age, etc. The tutorial walks through the various stages of the data science workflow. In particular, the tutorial has the following sections

Overview
Data Description
Data Exploration
Data Preparation
Training and Evaluating the Machine Learning Model
Interpreting the ML Model
Saving the Model
Making Predictions with the Model
Next Steps

Overview

The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

We’ll be using Python and some of its popular data science related packages. First of all, we will import pandas to read our data from a CSV file and manipulate it for further use. We will also use numpy to convert out data into a format suitable to feed our classification model. We’ll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build our classification model. Lastly, we will use joblib available in sklearn to save our model for future use.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inlinefrom sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

Data Description

We have our data saved in a CSV file called diabetes.csv. We first read our dataset into a pandas dataframe called diabetesDF, and then use the head() function to show the first five records from our dataset.

diabetesDF = pd.read_csv('diabetes.csv')
print(diabetesDF.head())

First 5 records in the Pima Indians Diabetes Database

The following features have been provided to help us predict whether a person is diabetic or not:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)2)
DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
Age: Age (years)
Outcome: Class variable (0 if non-diabetic, 1 if diabetic)

Let’s also make sure that our data is clean (has no null values, etc).

diabetesDF.info() # output shown below<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Note that the data does have some missing values (see Insulin = 0) in the samples in the previous figure. Ideally we could replace these 0 values with the mean value for that feature, but we’ll skip that for now.

Data Exploration

Let us now explore our data set to get a feel of what it looks like and get some insights about it.

Let’s start by finding correlation of every pair of features (and the outcome variable), and visualize the correlations using a heatmap.

corr = diabetesDF.corr()
print(corr)
sns.heatmap(corr, 
         xticklabels=corr.columns, 
         yticklabels=corr.columns)

Output of feature (and outcome) correlations

Heatmap of feature (and outcome) correlations

In the above heatmap, brighter colors indicate more correlation. As we can see from the table and the heatmap, glucose levels, age, BMI and number of pregnancies all have significant correlation with the outcome variable. Also notice the correlation between pairs of features, like age and pregnancies, or insulin and skin thickness.

Let’s also look at how many people in the dataset are diabetic and how many are not. Below is the barplot of the same:

Barplot visualization of number of non-diabetic (0) and diabetic (1) people in the dataset

It is also helpful to visualize relations between a single variable and the outcome. Below, we’ll see the relation between age and outcome. You can similarly visualize other feature. The figure is a plot of the mean age for each of the output classes. We can see that the mean age of people having diabetes is higher.

Average age of non-diabetic and diabetic people in the dataset

By the way — as a quick aside — this tutorial is taken from the Data Science Course on Commonlounge. The course includes many hands-on assignments and projects. In addition, 80% of the course contents are available for free! If you’re interested in learning Data Science, definitely recommend checking it out.

Dataset Preparation (splitting and normalization)

When using machine learning algorithms we should always split our data into a training set and test set. (If the number of experiments we are running is large, then we can should be dividing our data into 3 parts, namely — training set, development set and test set). In our case, we will also separate out some data for manual cross checking.

The data set consists of record of 767 patients in total. To train our model we will be using 650 records. We will be using 100 records for testing, and the last 17 records to cross check our model.

dfTrain = diabetesDF[:650]
dfTest = diabetesDF[650:750]
dfCheck = diabetesDF[750:]

Next, we separate the label and features (for both training and test dataset). In addition to that, we will also convert them into NumPy arrays as our machine learning algorithm process data in NumPy array format.

trainLabel = np.asarray(dfTrain['Outcome'])
trainData = np.asarray(dfTrain.drop('Outcome',1))
testLabel = np.asarray(dfTest['Outcome'])
testData = np.asarray(dfTest.drop('Outcome',1))

As the final step before using machine learning, we will normalize our inputs. Machine Learning models often benefit substantially from input normalization. It also makes it easier for us to understand the importance of each feature later, when we’ll be looking at the model weights. We’ll normalize the data such that each variable has 0 mean and standard deviation of 1.

means = np.mean(trainData, axis=0)
stds = np.std(trainData, axis=0)trainData = (trainData - means)/stds
testData = (testData - means)/stds# np.mean(trainData, axis=0) => check that new means equal 0
# np.std(trainData, axis=0) => check that new stds equal 1

Training and Evaluating Machine Learning Model

We can now train our classification model. We’ll be using a machine simple learning model called logistic regression. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code. First, we create an instance called diabetesCheck and then use the fit function to train the model.

diabetesCheck = LogisticRegression()
diabetesCheck.fit(trainData, trainLabel)

Next, we will use our test data to find out accuracy of the model.

accuracy = diabetesCheck.score(testData, testLabel)
print("accuracy = ", accuracy * 100, "%")

The print statement will print accuracy = 78.0 %.

Interpreting the ML Model

To get a better sense of what is going on inside the logistic regression model, we can visualize how our model uses the different features and which features have greater effect.

coeff = list(diabetesCheck.coef_[0])
labels = list(trainData.columns)
features = pd.DataFrame()
features['Features'] = labels
features['importance'] = coeff
features.sort_values(by=['importance'], ascending=True, inplace=True)
features['positive'] = features['importance'] > 0
features.set_index('Features', inplace=True)
features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({True: 'blue', False: 'red'}))
plt.xlabel('Importance')

Visualization of the weights in the Logistic Regression model corresponding to each of the feature variables

From the above figure, we can draw the following conclusions.

Glucose level, BMI, pregnancies and diabetes pedigree function have significant influence on the model, specially glucose level and BMI. It is good to see our machine learning model match what we have been hearing from doctors our entire lives!
Blood pressure has a negative influence on the prediction, i.e. higher blood pressure is correlated with a person not being diabetic. (also, note that blood pressure is more important as a feature than age, because the magnitude is higher for blood pressure).
Although age was more correlated than BMI to the output variables (as we saw during data exploration), the model relies more on BMI. This can happen for several reasons, including the fact that the correlation captured by age is also captured by some other variable, whereas the information captured by BMI is not captured by other variables.

Note that this above interpretations require that our input data is normalized. Without that, we can’t claim that importance is proportional to weights.

Saving the Model

Now we will save our trained model for future use using joblib.

joblib.dump([diabetesCheck, means, stds], 'diabeteseModel.pkl')

To check whether we have saved the model properly or not, we will use our test data to check the accuracy of our saved model (we should observe no change in accuracy if we have saved it properly).

diabetesLoadedModel, means, stds = joblib.load('diabeteseModel.pkl')
accuracyModel = diabetesLoadedModel.score(testData, testLabel)
print("accuracy = ",accuracyModel * 100,"%")

Making Predictions with the model

We will now use our unused data to see how predictions can be made. We have our unused data in dfCheck.

print(dfCheck.head())

We will now use the first record to make our prediction.

sampleData = dfCheck[:1]# prepare sample
sampleDataFeatures = np.asarray(sampleData.drop('Outcome',1))
sampleDataFeatures = (sampleDataFeatures - means)/stds# predict
predictionProbability = diabetesLoadedModel.predict_proba(sampleDataFeatures)
prediction = diabetesLoadedModel.predict(sampleDataFeatures)
print('Probability:', predictionProbability)
print('prediction:', prediction)

From above code we get:

Probability: [[ 0.4385153,  0.5614847]]
prediction: [1]

The first element of array predictionProbability 0.438 is the probability of the class being 0 and second element 0.561 is the probability of the class being 1. The probabilities sum to 1. As we can see that the 1 is more probable class, we get [1] as our prediction, which means that the model predicts that the person has diabetes.

Next steps

There are lots of ways to improve the above model. Here are some ideas.

Input feature bucketing should help, i.e. create new variables for blood pressure in a particular range, glucose levels in a particular range, and so on.
You could also improve the data cleaning, by replacing 0 values with the mean value.
Read a bit about what metrics do doctors rely on the most to diagnose a diabetic patient, and create new features accordingly.

See if you can get to 85–90% accuracy. You can get started with the jupyter notebook for this tutorial: pima_indians.ipynb.

Co-authored by Keshav Dhandhania and Bishal Lakha.

Originally published as a tutorial on www.commonlounge.com as part of the Data Science Course.