Imagine a client has approached you with a business problem and you need to solve it using your skills and expertise; there is reporting, analysis, and building machine learning models. You get started on the project and have a look at the data. You see the data and notice that the data is unclean, there is missing data, and there are discrepancies.
Before diving into Data Analysis, you need to process the data, fill the missing values (if possible), and convert it into a usable format. The question is "HOW"
![Image by Author, inspired by source [https://www.canva.com/]](https://towardsdatascience.com/wp-content/uploads/2021/02/1LwINM88DWJ7bCcwAsS8ADg.png)
In this article, I am going to show you the steps I use to prepare the data in usable form, and why I call it "The Data Quadrants"
The Axes
There are two fundamental requirements while I study that data. I would term them as:
- NEEDS – the data that you need for analysis and building machine learning models
- HAVES – the data that has been provided to you by the client, irrespective if you can use them or not
They are going to act as our main axes, as you can see the following figure:

Not all the companies in the world have the best data storing techniques; one would encounter lots of missing data. You would find yourself in all the quadrants.
Let’s take a deeper dive into each quadrant.
The Quadrants
Quadrant I: The Ideal Scenario

This is every data/business analysts’, data engineers’, machine learning engineers’, and data scientists’ dream scenario where the data given by the client is top-notch, high quality, have no missing values, and is clean.
When we start learning Data Science, we always start from this quadrant where the data is clean and complete. However, in the real world, this is not the case.
As I said, this is an "ideal" scenario. We have to work our way to reach this quadrant before we start analyzing the data and build our ML models.
Quadrant III: The Comfortable Quadrant

Based on the data, you would find few (or many) features that neither do you need nor do you have. It’s best to not use those features. Although based on the business problem, it is better to first get in touch with your client to know more about the features.
If the client confirms to not use this set of features, you can discard them. If the client says they are important, you end up in Quadrant 2.
We’ll talk about Quadrant 2 in a while. As they say, "Save the Best for Last".
Quadrant IV: The "Depends" Quadrant

It’s always interesting to notice that there exist few features that you have, but you wouldn’t need. Would it be better to not use them and discard them? Well, not always.
This is where your domain expertise would help you. Based on the business problem, you can take a call whether to use these features or drop them. If you are not a domain expert, you can always have a conversation with your client to understand if these features are important, and HOW they may affect the analysis.
DRUMROLL
Quadrant II: The Playground

Almost always, you would find yourself in Quadrant II; data you need, but you don’t have. This is the quadrant where you would find a lot of features with missing values. As you may have heard, "80% of the time is invested in "wrangling" or "munging" data before it’s ready to be analyzed" [1].
The good part is you know which features are needed for data analysis. The challenging part is filling up these missing values. One of the methods that is used to fill the missing values is termed as data enrichment, and this is a regular practice to enhance the quality of the data.
There are a couple of solutions/techniques that I have used, which are:
- Web Scraping
- Using REST API
- Purchasing bulk data from third-party data providers
- Imputation
I wouldn’t say this would directly get you to Quadrant I, because the above techniques and data sources mentioned above do not guarantee to extract or have complete information. But it would surely lead you to have better data quality for analysis and to build ML models.
Final Steps
It surely takes time to understand the data and the features, their importance, and their overall effect on the target feature. This method not only saves time, but helps to understand and prioritize your features, and plan the next steps to be taken by you and your team.
I hope you found this article useful. Feel free to leave your feedback in the comment section below.
References: