The world’s leading publication for data science, AI, and ML professionals.

Exploratory Data Analysis: Unraveling the Story Within Your Dataset

The secret art of exploring data- Understanding, cleaning, and unveiling the hidden insights within your dataset

As a data enthusiast, exploring a new dataset is an exciting endeavour. It allows us to gain a deeper understanding of the data and lays the foundation for successful analysis. Getting a good feeling for a new dataset is not always easy, and takes time. However, a good and thorough exploratory data analysis (EDA) can help a lot to understand your dataset and get a feeling for how things are connected and what needs to be done to properly process your dataset.

Infact, you probably will spend 80% of your time in data preparation and exploration and only 20% in actual data modelling. For other types of analysis, exploration might take an even larger proportion of your time.

**The What.

Exploratory Data Analysis, simply put, refers to the art of exploring data. It is the process of investigating data from different angles to enhance your understanding, exploring patterns, establishing relationships between variables and if required enhancing the data itself

Its like going on a ‘blind’ date with your dataset, sitting across the table from this enigmatic collection of numbers and texts, yearning to understand it before embarking on a serious relationship. Just like a blind date, EDA allows you to uncover the hidden facets of your dataset. You observe patterns, detect outliers, and explore the nuances before making any significant commitments. It’s all about getting acquainted and building trust with the numbers, ensuring you’re on solid ground before drawing conclusions.

We’ve all been there; knowingly or unknowingly, delving into statistical tools or sifting through reports – we’ve all explored some kind of data at some point!


**The Why.

We as analysts and data scientists are supposed to best understand the data. We must become the experts when it comes to understanding and interpreting the data. Whether it is machine learning models, experimentation frameworks or simple Analytics – the outcome is as good as the data on which it is based.

Remember, Garbage In, Garbage Out !!

EDA empowers data analysts and scientists to explore, understand, and derive meaningful insights from the data. Just when you think you’ve got it all figured out, the dataset throws you a curveball. You find missing values, inconsistencies, and messy data. It’s like discovering that your date has a secret pet alligator or a collection of unicorn figurines. Exploratory Data Analysis gives you the tools to clean up the mess and make sense of it all.

— It’s like giving your dataset a makeover, transforming it from a dishevelled mess to a dazzling companion.

In the end, Exploratory Data Analysis is all about getting to know your data on a deeper level, having fun along the way, and building a strong foundation for further analysis. So grab your detective hat and embark on this exciting adventure with your dataset. Who knows, you might just find a hidden treasure or even true love!


**The How.

Exploratory Data Analysis, as the name suggests is analysis to Explore the data. It consists of a number of components; neither are all essential all the time, nor all of them have equal importance. Below, I am listing down a few components based on my experience. Please note that it is by no means an exhaustive list, but a guiding framework.

1. Understand the lay of the land.

You don’t know what you don’t know – but you can explore! The first and foremost thing to do is to get the feel of the data – look at the data entries, eye-ball the column values. How many rows, columns you have.

  • a retailer dataset might tell you – Mr X visited store#2000 on the 01st of Aug 2023 and purchased a can of Coke and one pack of Walker Crisps
  • a social media dataset might tell you – Mrs Y logged onto the social networking website at 09:00 am on the 3rd of June and browsed A, B, and C sections, searched for her friend Mr A and then logged out after 20 mins.

It’s beneficial to get the business context of the data you have, knowing the source and mechanism of data collection; for e.g. survey data vs. digitally collected data etc.).

2. Double-click into variables

Variables are the talking tongue of a dataset, they are continuously talking to you. You just need to ask the right questions and listen carefully.

→ Questions to ask::– What do the variables mean/represent?

  • Are the variables continuous or categorical? .. Any inherent order?
  • What are the possible values they can take?

→ ACTION::

  • For continuous variables – check distributions using histograms, box-plots and carefully study the mean, median, standard deviations etc.
  • For categorical / ordinal variables – find out their unique values, and do a frequency table checking the most / least occurring ones.

You may or may not understand all variables, labels and values – but try to get as much information as you can

3. Look for patterns/relationships in your data

Through EDA, you can discover patterns, trends, and relationships within the data.

→ Questions to ask::– _Do you have any prior assumptions/hypothesis of relationships between variables?

  • Any business reason for some variables to be related to one another?
  • Do variables follow any particular distributions?_

Data Visualisation techniques, summaries, and correlation analysis help reveal hidden patterns that may not be apparent at first glance. Understanding these patterns can provide valuable insights for decision-making or hypothesis generation.

→ ACTION::Think visual bi-variate analysis.

  • In case of continuous variables – use scatter plots, create correlation matrix / heat maps etc.
  • A mixture of continuous and ordinal/categorical variables – Consider plotting bar or pie charts, and create good-old contingency tables to visualise the co-occurrence.

EDA allows you to validate statistical assumptions, such as normality, linearity, or independence, for analysis or data modelling.

4. Detecting anomalies.

Here’s your chance to become Sherlock Holmes on your data and look for anything out of the ordinary! Ask yourself::

– Are there any duplicate entries in the dataset?

Duplicates are entries that represent the same sample point multiple times. Duplicates are not useful in most cases as they do not give any additional information. They might be the result of an error and can mess up your mean, median and other statistics. → Check with your stakeholders and remove such errors from your data.

– Labelling errors for categorical variables?

Look for unique values for categorical variables and create a frequency chart. Look for mis-spellings and labels that might represent similar things?

– Do some variables have Missing Values?

This can happen to both numeric and categorical variables. Check if

  • Are there rows which have missing values for a lot of variables (columns)? This means there are data points which have blanks across the majority of columns → they are not very useful, we may need to drop them.
  • Are there variables (or columns) which have missing values across multiple rows? This means there are variables which do not have any values/labels across most data points → they cannot add much to our understanding, we may need to drop them.

ACTION::

  • Count the proportion of NULL or missing values for all variables. Variables with more than 15%-20% should make you suspicious.
  • Filter out rows with missing values for a column and check how the rest of the columns look. Is it that the majority of columns have missing values together ?.. is there a pattern?

– Are there Outliers in my dataset?

Outlier detection is about identifying data points that do not fit the norm. you may see very high or extremely low values for certain numerical variables, or a high/low frequency for categorical class variables.

  • What seems an outlier can be a data error.While outliers are data points that are unusual for a given feature distribution, unwanted entries or recording errors are samples that shouldn’t be there in the first place.
  • What seems an outlier can just be an outlier.In other cases, we might just have data points with extreme values and perfectly fine reasoning behind them.

ACTION::

Study the histograms, scatter plots, and frequency bar charts to understand if there are a few data points which are farther from the rest. Think through:

  • Can they be true and take these extreme values?
  • Is there a business reasoning or justification for these extremities
  • Would they add value to your analysis at a later stage

5. Data Cleaning.

Data cleaning refers to the process of removing unwanted variables and values from your dataset and getting rid of any irregularities in it. These anomalies can disproportionately skew the data and hence adversely affect the results of our analysis from this dataset.

Remember: Garbage In, Garbage Out

– Course correct your data.

  • Remove the duplicate entries if you find any, missing values and outliers – which do not add value to your dataset. Get rid of unnecessary rows/ columns.
  • Correct any mis-spellings, or mis-labelling you observe in the data.
  • Any data errors you spot which are not adding value to the data also need to be removed.

– Cap Outliers or let them be.

  • In some data modelling scenarios, we may need to cap outliers at either end. Capping is often done at the 99th/95th percentile for the higher end or the 1st/5th percentile for the lower-end capping.

– Treat Missing Values.

We generally drop data points (rows) with a lot of missing values across variables. Similarly, we drop variables (columns) which have missing values across a lot of data points

If there are a few missing values we might look to plug those gaps or just let them be as it is.

  • For continuous variables with missing values, we can plug them by using mean or median values (maybe across a particular strata)
  • For categorical missing values, we might assign the most used ‘class’ or maybe create a new ‘not defined’ class.

– Data enrichment.

Based on the needs of the future analysis, you can add more features (variables) to your dataset; such as (not restricted to)

  • Creating binary variables indicating the presence or absence of something.
  • Creating additional labels/classes by using IF-THEN-ELSE clauses.
  • Scale or encode your variables as per your future analytics needs.
  • Combine two or more variables – use arrange of mathematical functions like sum, difference, mean, log and many other transformations.

Summary

EDA enables data scientists to uncover valuable insights, address data quality issues, and lay a strong foundation for further analysis and modelling. It ensures that the outcomes of data analysis are reliable, accurate, and impactful.

Key Components of EDA:

  1. Understand the source and ‘meaning’ of your data.
  2. Know all variables, their distributions, labels/classes in and out.
  3. Look for patterns/relationships between variables to validate any prior hypothesis or assumptions.
  4. Detect any anomalies – data errors, outliers, missing values.
  5. Data Cleaning – remove or course-correct any data errors/anomalies, cap outliers, fill in missing values (if needed), scale/transform existing variables and create additional derived ones enriching your dataset for subsequent analysis.

Connect, Learn & Grow ..

If you like this article and are interested in similar ones follow me on Medium, LinkedIn, connect with me 1:1, join my email list and (..if you already are not..) hop on to become a member of the Medium family to get access to thousands of helpful articles.****(I will get ~50% of your membership fees if you use the above link)

.. Keep learning and keep growing!


Related Articles