The Data Quadrants of Haves and Needs

An offbeat approach to prioritize your features for data analysis and building machine learning models

Neeraj Khadagade
Towards Data Science

--

Imagine a client has approached you with a business problem and you need to solve it using your skills and expertise; there is reporting, analysis, and building machine learning models. You get started on the project and have a look at the data. You see the data and notice that the data is unclean, there is missing data, and there are discrepancies.

Before diving into data analysis, you need to process the data, fill the missing values (if possible), and convert it into a usable format. The question is “HOW

Image by Author, inspired by source [https://www.canva.com/]

In this article, I am going to show you the steps I use to prepare the data in usable form, and why I call it “The Data Quadrants”

The Axes

There are two fundamental requirements while I study that data. I would term them as:

  1. NEEDS — the data that you need for analysis and building machine learning models
  2. HAVES — the data that has been provided to you by the client, irrespective if you can use them or not

They are going to act as our main axes, as you can see the following figure:

The Quadrants (Image by author)

Not all the companies in the world have the best data storing techniques; one would encounter lots of missing data. You would find yourself in all the quadrants.

Let’s take a deeper dive into each quadrant.

The Quadrants

Quadrant I: The Ideal Scenario

Quadrant I (Image by author)

This is every data/business analysts’, data engineers’, machine learning engineers’, and data scientists’ dream scenario where the data given by the client is top-notch, high quality, have no missing values, and is clean.

When we start learning data science, we always start from this quadrant where the data is clean and complete. However, in the real world, this is not the case.

As I said, this is an “ideal” scenario. We have to work our way to reach this quadrant before we start analyzing the data and build our ML models.

Quadrant III: The Comfortable Quadrant

Quadrant III (Image by author)

Based on the data, you would find few (or many) features that neither do you need nor do you have. It’s best to not use those features. Although based on the business problem, it is better to first get in touch with your client to know more about the features.

If the client confirms to not use this set of features, you can discard them. If the client says they are important, you end up in Quadrant 2.

We’ll talk about Quadrant 2 in a while. As they say, “Save the Best for Last”.

Quadrant IV: The “Depends” Quadrant

Quadrant IV (Image by author)

It’s always interesting to notice that there exist few features that you have, but you wouldn’t need. Would it be better to not use them and discard them? Well, not always.

This is where your domain expertise would help you. Based on the business problem, you can take a call whether to use these features or drop them. If you are not a domain expert, you can always have a conversation with your client to understand if these features are important, and HOW they may affect the analysis.

**DRUMROLL**

Quadrant II: The Playground

Quadrant II (Image by author)

Almost always, you would find yourself in Quadrant II; data you need, but you don’t have. This is the quadrant where you would find a lot of features with missing values. As you may have heard, “80% of the time is invested in “wrangling” or “munging” data before it’s ready to be analyzed” [1].

The good part is you know which features are needed for data analysis. The challenging part is filling up these missing values. One of the methods that is used to fill the missing values is termed as data enrichment, and this is a regular practice to enhance the quality of the data.

There are a couple of solutions/techniques that I have used, which are:

  1. Web Scraping
  2. Using REST API
  3. Purchasing bulk data from third-party data providers
  4. Imputation

I wouldn’t say this would directly get you to Quadrant I, because the above techniques and data sources mentioned above do not guarantee to extract or have complete information. But it would surely lead you to have better data quality for analysis and to build ML models.

Final Steps

It surely takes time to understand the data and the features, their importance, and their overall effect on the target feature. This method not only saves time, but helps to understand and prioritize your features, and plan the next steps to be taken by you and your team.

I hope you found this article useful. Feel free to leave your feedback in the comment section below.

--

--