Exploratory Data Analysis — Unravelling a story with data

In this article, we will explore how to unearth hidden patterns in data.

Pritha Saha
Towards Data Science

--

I found the Housing Prices competition on Kaggle a great place to hone your analytical skills and start thinking like a Data Scientist. For some context, do read the Ames Housing Data documentation. The 80 explanatory variables seem to be typically the kind of factors which house owners would quote as influencing the price. However there would be some variables which influence the Sale Price more than others and some which do not play a role in influencing the Sale Price at all.

We will begin by studying the overall data, using describe and info methods. As you will see the dataset has 80 columns. To get some context about the variables, do refer to the data description file. The target variables is Sale Price which is right skewed as per below:

To normalise the data we need to take the logarithm of the Sale Price. Also, the Kaggle problem metric states RMSE between log of observed and predicted price. On taking log, we note the below graph:

Coming to the variables, we see that there are both numerical and categorical features. It would be a good idea to keep a note of the different features.

#Numeric Features
numeric_features = train.select_dtypes(include=[np.number])
numeric_features.dtypes
#Categorical Features
categoricals = train.select_dtypes(exclude=[np.number])
print(categoricals.dtypes)
categoricals.describe()
There are 43 categorical variables and the rest are numeric

We will shortly begin studying the variables now, but before that we should have a look at the null values in the train dataset.

sns.heatmap(train.isnull(), cbar=False)
PoolQC and Alley have maximum missing values

Note that PoolQC data could be missing simply because there might be no pool in most of the houses. This fact can be double checked by observing the data in the PoolArea column. Almost all columns have a value of zero. Hence we can assume safely that probably these two columns will not be very important for predicting the Sale Price.

However if you look at the variable Alley, missing data indicates that there is no alley access. We will see soon how Alley influences Sale Price.

The below correlation map gives us the summary relations between all the variables:

corr = numeric_features.corr()
plt.subplots(figsize=(20, 9))
sns.heatmap(corr, square=True)

Notice how our assumption about PoolArea is proved right. The correlation with Sale Price is approximately zero indeed!

MSSubclass is actually a categorical variable, but since the categories are numbers, it gets wrongly labelled as a numeric one.It’s very easy to miss this fact if someone does not go through the data description text!

Also some variables seem to be displaying multicollinearity ie, correlated among themselves (denoted by the white squares).

TotRoomsAbvGrd & GrLivArea- Generally a big house will have a bigger living room and more number of rooms, hence data seems consistentGarageArea & GarageCars- This sounds logical, more garage area will translate to more car capacity, and vice versa1stFloorSF & TotalBsmtSF- Ideally basement will be the same size as 1st floor, and the reverse is also true in most cases.GarageYrBlt & YearBuilt- Building a garage will depend on the house being built, but not vice versa! Hence this warrants some investigation!

On plotting a scatter plot of YearBuilt vs GarageYrBlt, note that in some cases the YearBuilt is more than GarageYrBlt! It’s a clear case of wrong data entry.

We will tackle the time variables later while feature engineering.

Moving on to the other variables, I actually plotted each and every of them to check the distribution and how they are influencing the Sale Price. It is advisable to check which are the most important variables, influencing the Sale Price. The top variables are (Ignore Sale Price):

print(corr[‘SalePrice’].sort_values(ascending=False)[:5],’\n’)SalePrice      1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431

On plotting the GrLivArea vs logarithm of Sale Price, this is what I found:

Overall, it shows an increasing trend, but there are some outliers clearly, which must be removed before proceeding with the training of data.

It would be impossible to find the outliers without studying the data well, hence although tedious, would recommend, exploring each and every variable before preparing the data for machine learning.

We had parked the variable Alley for further investigation. Let’s check whether it influences Sale Price. On plotting a box plot of Alley vs Sale Price, this is what we find:

It definitely looks like paved alleys demand higher prices than gravel ones. Hence houses with no alley access would probably be even lower priced than gravelled ones. This is definitely an important column, which you would want to retain.

Again while exploring categorical variables, you will find many variables with data dominated by one category. For instance, on checking the the dominance of ExterCond types in the combined data (train + test) with a countplot, this is what I found:

Most houses have “TA”, which means typical condition, as per the data description. Now on plotting a box plot of ExterCond with Sale Price, we get the below graph:

It is clear from this plot that “TA”, “Gd” and “Ex” houses end up having similar Sale Price, which means they don’t really influence our target variable. This sounds logical. Have we ever heard of present condition of exterior material influencing Sale Price?

But we have definitely experienced locality influencing the house price, which is reaffirmed by the below graph of Neighbourhood vs Sale Price:

In summary, the basic behavioural quality one needs while approaching EDA is curiosity. Discovering null values, outliers or irrelevant features are just outcomes of that quality. This article is aimed at getting started on the thought process, that will make you a successful data scientist.

Stay tuned for my next blog on data wrangling/feature engineering and eventually machine learning.

--

--

I love working with data and have been recently indulging myself in the field of data science.