The world’s leading publication for data science, AI, and ML professionals.

Build Your First Machine Learning Model With Python in 7 minutes

Using Pandas, NumPy, and Scikit-learn

Photo by Marcel Eberle on Unsplash
Photo by Marcel Eberle on Unsplash

When I first started to learn about Data Science, machine learning sounded like an extremely difficult subject. I was reading about algorithms with fancy names such as support vector machine, gradient boosted decision trees, logistic regression, and so on.

It did not take me long to realize that all those algorithms are essentially capturing the relationships among variables or the underlying structure within the data.

Some of the relationships are crystal clear. For instance, we all know that, everything else being equal, the price of a car decreases as it gets older (excluding the classics). However, some relationships are not so intuitive and not easy for us to notice.

Think simple and learn the fundamentals first.

In this article, we will create a simple Machine Learning algorithm to predict customer churn. I will also explain some fundamental points that you need to pay extra attention to. Thus, we will not only practice but also learn some theory.

We will be using Python libraries. To be more specific, Pandas and NumPy for data wrangling and Scikit-learn for preprocessing and machine learning tasks.

The dataset is available on Kaggle under creative commons license with no copyright. Let’s start with reading the dataset.

import numpy as np
import pandas as pd
df = pd.read_csv("/content/Churn_Modelling.csv")
print(df.shape)
(10000, 14)
df.columns
Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard','IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

The dataset contains 10000 rows and 14 columns. We are expected to predict customer churn (i.e. Exited = 1) using the other 13 columns. The exited column is called the target or independent variable. The other columns are called features or dependent variables.

The row number, surname, and customer id are redundant features so we can drop them. The id of customers or their surnames have no effect on customer churn.

df.drop(["RowNumber","CustomerId","Surname"], axis=1, inplace=True)
df.head()
df (image by author)
df (image by author)

Encoding categorical variables

A typical dataset contains both categorical and numerical variables. A big portion of the machine learning algorithms only accept numerical variables. Thus, encoding categorical variables is a common preprocessing task.

Our dataset contains two categorical variables which are geography and gender. Let’s check the distinct values in these columns.

df.Geography.unique()
array(['France', 'Spain', 'Germany'], dtype=object)
df.Gender.unique()
array(['Female', 'Male'], dtype=object)

One option for converting these values to numbers is to assign an integer to each one. For instance, France is 0, Spain is 1, and Germany is 2. This process is called label encoding.

The problem with this approach is that the algorithms might consider these numbers as a hierarchical relation. Germany is thought to have higher precedence than France.

To overcome this problem, we can use an approach called one-hot encoding. Each distinct value is represented as a binary column. If the value in the geography column is France, then only the France column takes the value 1. The others become zero.

The get_dummies function of Pandas can be used for this task.

geography = pd.get_dummies(df.Geography)
gender = pd.get_dummies(df.Gender)
df = pd.concat([df, geography, gender], axis=1)
df[["Geography","Gender","France","Germany","Spain","Female","Male"]].head()
One-hot encoding (image by author)
One-hot encoding (image by author)

Exploratory data analysis

Exploratory data analysis is a crucial step before designing and implementing your model. The goal is to have a comprehensive understanding of the data at hand.

We will pass this step with a few controls. You will be spending more and more time exploring the data as you gain more experience.

A typical task in exploratory data analysis is to check the distribution of the numerical variables. You can also detect outliers (i.e. extreme values) with this approach.

Histograms are great for checking the distribution of variables. You can use a data visualization library such as Seaborn or Matplotlib. I prefer to use Pandas since it is quite simple to produce basic plots.

Let’s start with the balance column.

df.Balance.plot(kind="hist", figsize=(10,6))
Histogram of the balance column (image by author)
Histogram of the balance column (image by author)

It seems like a lot of customers have zero balance. It might be better to convert this column to binary, 0 for no balance and 1 for positive balance.

We can accomplish this task using the where function of NumPy as below:

df.Balance = np.where(df.Balance==0, 0, 1)
df.Balance.value_counts()
1    6383 
0    3617 
Name: Balance, dtype: int64

One third of the customers have 0 balance.

Let’s also draw the histogram of age column.

df.Age.plot(kind="hist", figsize=(10,6))
Histogram of the age column (image by author)
Histogram of the age column (image by author)

It is close to a normal distribution. The values above 80 might be considered as outliers but we will not focus on detecting the outliers for now.

Feel free to explore the distribution of other numerical variables such as tenure, number of products, and estimated salary.


Scaling numerical variables

As you might notice, the value ranges of the numerical variables are very different. The age values are less than 100 whereas the estimated salaries are more than 10 thousand.

If we use these features as they are, the model might give more importance to the column with higher values. Thus, it is better to scale them to the same range.

A common approach is min-max scaling. The highest and lowest values are scaled to 1 and 0, respectively. The ones in between are scaled accordingly.

There are more advanced scaling options as well. For instance, in case of a column with extreme values, mix-max scaling is not the best option. However, we will stick to the simple case for now. The point here is to emphasize the importance of feature scaling.

We will perform feature scaling after introducing another highly important topic.


Train-test split

A machine learning model learns by training. We feed the model with data and it learns the relationships between variables or the structure within the data.

After a model is trained, it should be tested. However, it is not acceptable to test a model with the data it trained on. It would be similar to cheating. A model might just memorize everything in the data and give you 100% accuracy.

Thus, before training a model, it is common practice to set aside a part of data for testing. The model performance should be evaluated using the test data.

The important thing here is that a model should not have any information about the test data. Hence, the feature scaling we discussed in the previous step should be done after splitting the train and test sets.

We can do this manually or use the train_test_split function of Scikit-learn.

from sklearn.model_selection import train_test_split
X = df.drop(["Exited","Geography","Gender"], axis=1)
y = df["Exited"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

The X contains the features and y contains the target variable. By default, 25% of the entire data is set aside for testing. You can change this ratio by using the test_size or train_size parameters.

X_train.shape
(7500, 13)
X_test.shape
(2500, 13)

We can now do the feature scaling. The MinMaxScaler function of Scikit-learn can be used for this task. We will create a Scaler object and train it with X_train.

We will then use the trained scaler to transform (or scale) X_train and X_test. Thus, the model will not be given any hint or information about the test set.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

Model training

This part actually includes two steps. The first one is to choose an algorithm. There are several machine learning algorithms. They all have some pros and cons.

We cannot cover all the algorithms in an article. So we will pick one and continue. The logistic regression algorithm is a commonly used one for binary classification tasks.

This process with Scikit-learn is as follows:

  • Create a logistic regression object which is our model
  • Train the model with the training set
  • Evaluate its performance on both training and test sets based on a particular metric

The following code will perform all these steps:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create a model and train it
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Evaluate model performance
print(accuracy_score(y_train, y_train_pred))
0.78
print(accuracy_score(y_test, y_test_pred))
0.80

Our model achieved an accuracy of 78% on the training set and 80% on the test set. It can definitely be improved.

Some ways to improve model performance are:

  • Collecting more data
  • Trying different algorithms
  • Hyperparameter tuning

We will not get into model improvement in this article. I think that the most effective method is to collect more data. The potential improvement with the other two methods are limited.


Conclusion

We have covered a basic workflow for creating a machine learning model. In a real-life case, each step is more detailed and studied in-depth. We have only scratched the surface.

Once you are comfortable with the basic workflow, you can focus on improving each step.

Building a decent machine learning model is an iterative process. You may have to modify your model or features several times after performance evaluation.

How you evaluate a classification model is also very critical. In many cases, we cannot just use a simple accuracy metric. I previously wrote an article about how to best evaluate a classification model.

Thank you for reading. Please let me know if you have any feedback.


Related Articles