The world’s leading publication for data science, AI, and ML professionals.

Exploratory Data Analysis and Data Cleaning Practical Workout

A step by step guide to prepare data for ML algorithms.

Photo by fabio on Unsplash
Photo by fabio on Unsplash

Exploratory Data Analysis (EDA) is a vital part of data science as it helps to discover relationships between the entities of the data we are working on. It is helpful to use EDA when we’re dealing with data for the first time. It also helps with large datasets as it is not practically possible to determine relationships with large unknown data. As we deep dive in EDA, we find various patterns which prove to be fruitful for further data processing and modeling.

In this article, we will try to uncover the factors that affect an Indian engineering graduate’s salary and subsequently select only relevant categories that have high impact on the salary. The cleaned data will be ready for any regression algorithm to be used which can predict the salary.

Dataset

For this EDA, we will be using ‘Engineering Graduate Salary Prediction’ dataset available on Kaggle. It contains detailed information about the categories available in the data if you wish to understand the same.

Engineering Graduate Salary Prediction

Import the libraries

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

Load the data

df = pd.read_csv('../input/engineering-graduate-salary-prediction/Engineering_graduate_salary.csv')
df.head()

Now lets analyze the data and get basic overview of what we’re dealing with.

df.columns

Now if we observe carefully, there are some columns that obviously do not affect salaries like ‘ID’ , ‘DOB’, boards of 10th and 12th and so on. We will get rid of these redundant columns.

df = df.drop(['ID', 'DOB', 'CollegeID', '12graduation' ,'GraduationYear','10board', '12board' , 'CollegeState','CollegeCityID', 'CollegeCityTier',],axis = 1)
df.head()

Now, we should have 24 columns left.

Let’s check whether there are duplicates in our data.

df.duplicated().sum()

On running this code, we see that there are no duplicates to take care of.

Now, let us check the Specialization column.

df.Specialization.value_counts()

Output:

electronics and communication engineering      670
computer science & engineering                 557
information technology                         506
computer engineering                           415
computer application                           201
mechanical engineering                         155
electronics and electrical engineering         148
electronics & telecommunications                89
electrical engineering                          63
electronics & instrumentation eng               24
information science engineering                 18
electronics and instrumentation engineering     18
instrumentation and control engineering         18
civil engineering                               15
electronics engineering                         13
biotechnology                                   12
other                                           10
industrial & production engineering              8
chemical engineering                             7
mechanical and automation                        5
applied electronics and instrumentation          5
telecommunication engineering                    4
computer science and technology                  4
automobile/automotive engineering                4
aeronautical engineering                         3
electronics and computer engineering             3
mechatronics                                     3
instrumentation engineering                      3
industrial engineering                           2
biomedical engineering                           2
metallurgical engineering                        2
information & communication technology           1
computer and communication engineering           1
information science                              1
mechanical & production engineering              1
ceramic engineering                              1
computer networking                              1
control and instrumentation engineering          1
electrical and power engineering                 1
embedded systems technology                      1
electronics                                      1
industrial & management engineering              1
Name: Specialization, dtype: int64

If we observe this, we can understand that there are a lot of single values which can unnecessarily affect the model if we try to train it using this data. Let’s group this data together as they’re not adding much value to the ‘information’ as a whole.

specialization = df.Specialization.value_counts() #Store the values # in specialization
lessthan10 = specialization[specialization<=10]
lessthan10
def correctSpecialization(val):
    if val in lessthan10:
        return 'Others'
    else:
        return val
df['Specialization'] = df['Specialization'].apply(correctSpecialization)
df['Specialization'].value_counts()

Cool. This data seems like something we can work with.

Lets check the statistics now.

df.describe() #Understand the statistics of given data.

If we observe we can see that GPA has max value 99.93 and min value 6.63 which is a big range and 6.63 probably is some wrong value. We don’t know that yet! But, something seems definitely wrong! Let’s analyze this and see if we can get rid of the outliers if any.

plt.scatter(df.index,df['collegeGPA'])

Graphs and plots can be relied on heavily when it comes to EDA. These simple steps easily help to visualize and identify with first look whether some outliers are there. This plot clearly shows that the values mostly lie in 50–100 range and we can safely drop values less than 20 which can introduce unnecessary bias.

df = df[(df['collegeGPA'] > 40)]

You can plot the same graph again and observe that the data is uniform now.

I’ve done similar steps with other columns and to save time, I’ve only mentioned which columns need further analyzing and cleaning. Other columns are quite uniform, but you can analyze them too for practice and use any method to deal with the same.

plt.figure(figsize = (15,15))
plt.subplot(3,3,1)
plt.scatter(df.index, df.ElectronicsAndSemicon)
plt.title('ElectronicsAndSemicon')
plt.subplot(3,3,2)
plt.scatter(df.index, df.Domain)
plt.title('Domain')
plt.subplot(3,3,3)
plt.scatter(df.index, df.ComputerProgramming)
plt.title('ComputerProgramming')
plt.show()

Definitely in all 3 plots we can see presence of outliers. Let’s take care of that.

df = df.replace(-1,np.nan)
cols_with_nan = [col for col in df.columns if df.isna().sum()[col]>0]
for col in cols_with_nan:
    df[col] = df[col].fillna(df[col].mean())

I’ve replaced the -1 values with nan first then replaced all those with the mean values of those columns.

Now if we compare this with original data, we can definitely observe that our new data is much more clean and workable with. It contains values that are uniform and affect the salary in some way.

Let’s check the number of males and females.

sns.countplot(df.Gender,palette = 'inferno')

Let’s see if there is any relation between 10th and 12th percentage.

sns.scatterplot(df['10percentage'],df['12percentage'])

Clearly, a linearity can be seen here between 10th and 12th percentage.

This data can be analyzed further using all the categories and comparing with the salary. For example:

plt.figure(figsize = (10,5))
sns.scatterplot(df.collegeGPA,df.Salary, hue = df.Degree,palette = 'inferno')

Here, we can observe that mostly B.Tech/B.E. students are getting jobs with higher salaries. Similarly other columns can be compared with salary.

Now let us see if there is any relation between ‘openess_to_experience’ and salary. It can be observed that higher scores graduates get higher salaries.

plt.figure(figsize = (10,5))
sns.scatterplot(df.openess_to_experience,df.Salary)

Data Preprocessing

This last part is done so the final data is ready to feed to the ML model.

Let’s check the datatypes to see if any processing is needed.

df.info()

Gender, degree, specialization need to be processed as they are of type ‘object’ not suitable for ML model input.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.Gender = le.fit_transform(df.Gender)
df.Degree = le.fit_transform(df.Degree)
df.Specialization = le.fit_transform(df.Specialization)

Now, let’s check the final data.

df.head()

Yay! The data is finally ready. You can try various algorithms on this data now for salary prediction! This entire process is definitely important to cover to avoid bad model predictions and just increase all the work all over again!

That’s it from my side! Suggestions are always welcome. 🙂


Related Articles