
Exploratory Data Analysis (EDA) is a vital part of data science as it helps to discover relationships between the entities of the data we are working on. It is helpful to use EDA when we’re dealing with data for the first time. It also helps with large datasets as it is not practically possible to determine relationships with large unknown data. As we deep dive in EDA, we find various patterns which prove to be fruitful for further data processing and modeling.
In this article, we will try to uncover the factors that affect an Indian engineering graduate’s salary and subsequently select only relevant categories that have high impact on the salary. The cleaned data will be ready for any regression algorithm to be used which can predict the salary.
Dataset
For this EDA, we will be using ‘Engineering Graduate Salary Prediction’ dataset available on Kaggle. It contains detailed information about the categories available in the data if you wish to understand the same.
Import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Load the data
df = pd.read_csv('../input/engineering-graduate-salary-prediction/Engineering_graduate_salary.csv')
df.head()

Now lets analyze the data and get basic overview of what we’re dealing with.
df.columns

Now if we observe carefully, there are some columns that obviously do not affect salaries like ‘ID’ , ‘DOB’, boards of 10th and 12th and so on. We will get rid of these redundant columns.
df = df.drop(['ID', 'DOB', 'CollegeID', '12graduation' ,'GraduationYear','10board', '12board' , 'CollegeState','CollegeCityID', 'CollegeCityTier',],axis = 1)
df.head()
Now, we should have 24 columns left.
Let’s check whether there are duplicates in our data.
df.duplicated().sum()
On running this code, we see that there are no duplicates to take care of.
Now, let us check the Specialization column.
df.Specialization.value_counts()
Output:
electronics and communication engineering 670
computer science & engineering 557
information technology 506
computer engineering 415
computer application 201
mechanical engineering 155
electronics and electrical engineering 148
electronics & telecommunications 89
electrical engineering 63
electronics & instrumentation eng 24
information science engineering 18
electronics and instrumentation engineering 18
instrumentation and control engineering 18
civil engineering 15
electronics engineering 13
biotechnology 12
other 10
industrial & production engineering 8
chemical engineering 7
mechanical and automation 5
applied electronics and instrumentation 5
telecommunication engineering 4
computer science and technology 4
automobile/automotive engineering 4
aeronautical engineering 3
electronics and computer engineering 3
mechatronics 3
instrumentation engineering 3
industrial engineering 2
biomedical engineering 2
metallurgical engineering 2
information & communication technology 1
computer and communication engineering 1
information science 1
mechanical & production engineering 1
ceramic engineering 1
computer networking 1
control and instrumentation engineering 1
electrical and power engineering 1
embedded systems technology 1
electronics 1
industrial & management engineering 1
Name: Specialization, dtype: int64
If we observe this, we can understand that there are a lot of single values which can unnecessarily affect the model if we try to train it using this data. Let’s group this data together as they’re not adding much value to the ‘information’ as a whole.
specialization = df.Specialization.value_counts() #Store the values # in specialization
lessthan10 = specialization[specialization<=10]
lessthan10
def correctSpecialization(val):
if val in lessthan10:
return 'Others'
else:
return val
df['Specialization'] = df['Specialization'].apply(correctSpecialization)
df['Specialization'].value_counts()

Cool. This data seems like something we can work with.
Lets check the statistics now.
df.describe() #Understand the statistics of given data.

If we observe we can see that GPA has max value 99.93 and min value 6.63 which is a big range and 6.63 probably is some wrong value. We don’t know that yet! But, something seems definitely wrong! Let’s analyze this and see if we can get rid of the outliers if any.
plt.scatter(df.index,df['collegeGPA'])

Graphs and plots can be relied on heavily when it comes to EDA. These simple steps easily help to visualize and identify with first look whether some outliers are there. This plot clearly shows that the values mostly lie in 50–100 range and we can safely drop values less than 20 which can introduce unnecessary bias.
df = df[(df['collegeGPA'] > 40)]
You can plot the same graph again and observe that the data is uniform now.
I’ve done similar steps with other columns and to save time, I’ve only mentioned which columns need further analyzing and cleaning. Other columns are quite uniform, but you can analyze them too for practice and use any method to deal with the same.
plt.figure(figsize = (15,15))
plt.subplot(3,3,1)
plt.scatter(df.index, df.ElectronicsAndSemicon)
plt.title('ElectronicsAndSemicon')
plt.subplot(3,3,2)
plt.scatter(df.index, df.Domain)
plt.title('Domain')
plt.subplot(3,3,3)
plt.scatter(df.index, df.ComputerProgramming)
plt.title('ComputerProgramming')
plt.show()

Definitely in all 3 plots we can see presence of outliers. Let’s take care of that.
df = df.replace(-1,np.nan)
cols_with_nan = [col for col in df.columns if df.isna().sum()[col]>0]
for col in cols_with_nan:
df[col] = df[col].fillna(df[col].mean())
I’ve replaced the -1 values with nan first then replaced all those with the mean values of those columns.
Now if we compare this with original data, we can definitely observe that our new data is much more clean and workable with. It contains values that are uniform and affect the salary in some way.
Let’s check the number of males and females.

sns.countplot(df.Gender,palette = 'inferno')

Let’s see if there is any relation between 10th and 12th percentage.
sns.scatterplot(df['10percentage'],df['12percentage'])
Clearly, a linearity can be seen here between 10th and 12th percentage.
This data can be analyzed further using all the categories and comparing with the salary. For example:
plt.figure(figsize = (10,5))
sns.scatterplot(df.collegeGPA,df.Salary, hue = df.Degree,palette = 'inferno')

Here, we can observe that mostly B.Tech/B.E. students are getting jobs with higher salaries. Similarly other columns can be compared with salary.
Now let us see if there is any relation between ‘openess_to_experience’ and salary. It can be observed that higher scores graduates get higher salaries.
plt.figure(figsize = (10,5))
sns.scatterplot(df.openess_to_experience,df.Salary)

Data Preprocessing
This last part is done so the final data is ready to feed to the ML model.
Let’s check the datatypes to see if any processing is needed.
df.info()

Gender, degree, specialization need to be processed as they are of type ‘object’ not suitable for ML model input.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.Gender = le.fit_transform(df.Gender)
df.Degree = le.fit_transform(df.Degree)
df.Specialization = le.fit_transform(df.Specialization)
Now, let’s check the final data.
df.head()

Yay! The data is finally ready. You can try various algorithms on this data now for salary prediction! This entire process is definitely important to cover to avoid bad model predictions and just increase all the work all over again!
That’s it from my side! Suggestions are always welcome. 🙂