The importance of data preparation in the process of model creation

Introduction
A common misconception among many is that AI is a magical black box, which turns your data into gold! This is partially true. AI does transform your data into gold (not literally though), given that you feed it good quality data! But in today’s world, good quality data is only fictional. So how do you overcome this? How do you feed that "good quality" data to your model so that you can reap all the great benefits of AI? This is exactly where the process of Data Preparation comes in. But before we jump into the fun part, what exactly is "bad data"?

The real world is a complete mess. And so is the data generated by it. It comes in all shapes and forms, ranging from a few lines of text, maybe a tweet by a celebrity or a review by an angry customer, to a couple of millions of rows of numerical, could be a national census or a sensor data from an IoT device. It could be anything in between as well. With data sources as vast, there is little chance that the data would be wrapped and delivered. Nope! Many inconsistencies manage to creep in. There could be some emojis in the text or some shorthand abbreviations like gm, np, lol, etc. (thanks to the 280 character limit on Twitter). There could be some human error in the surveys as well because it’s our natural tendency to do mistakes, isn’t it? There could be some data missing because some gentleman didn’t want to disclose his age, or there could be some wrong data because some lady decreased her age by a couple of years! So you see, inconsistent data is common, and you can expect some form of "bad data" whenever you begin a data project unless you download a near to perfect dataset from Kaggle (No offense!). It’s our responsibility to prepare the data before we even think of preparing any fancy model. So what do we do? Well, we prepare the data!
Steps involved in data preparation

You would have heard this maybe like a thousand times already, but let me tell you again. Data preparation is the most crucial stage of any data project and takes up to 70% of the project time! But what do we do exactly in data preparation? There couldn’t be a magic wand that magically prepares your data for you, right? Yeah, I hope there was, but there isn’t. So what happens in the data preparation stage? Well, there are a few steps that we need to follow while preparing the data.
Data Cleaning
This is the first step in the data preparation stage. Before we can even think of moving forward with any other steps, we need to make sure that the data is clean. We don’t want any missing data, abnormal values, or noisy data. In this stage, we do tasks like handling the missing values, handling noisy data and outliers, and removing any unwanted data. Each of these steps has separate steps of its own and comes with its individual challenges. Let’s look into each of them.
Handling the outliers

In Statistics, an outlier is a data point that significantly differs from the other data observations in our dataset. Outliers can be caused due to many reasons. However, it’s important to properly deal with outliers because they can be either an essential part of the data or just some error while collecting the data. For example, when collecting the information of salaries in a company, the salary of a senior executive could be significantly high compared to other members of the company. This data point is clearly an outlier as there would be very few such cases. But this data point is critical for the analysis as it is perfectly valid and an important part of the entire data. In a similar case, when collecting heights data of middle school students, a value of 55.6 ft. is highly impossible. It might have originally been 5.56 ft. but due to an error during collecting the data, our entire analysis would be completely screwed up! So how do we deal with outliers? There are a few techniques to handle outliers. However, it’s important to know the domain to identify whether the outliers are valid or invalid. To deal with outliers, it’s important to first identify the outliers.
Outliers can be identified by plotting a histogram or by performing cluster analysis. You can also plot a box plot to find the outliers. Generally, a data point is considered an outlier when it is greater than (1.5 x IQR) + 3rd quartile or less than (1.5 x IQR) – 1st quartile.
Now that we have identified the outliers, what do we do next? Two common practices are either to remove these outliers or to smooth them using some transformation function like a log or box-cox transformation function. But we should be careful while choosing to remove the outliers, as they could be valid observations of the data. A piece of sound domain knowledge helps in choosing appropriate steps to take in this matter.
Handling the missing values

Missing values occur when no data value is stored for a variable in the dataset. Just like outliers, these can occur due to various reasons. Could be a mistake by the researcher/surveyor or an error by the instrument used to collect the data. Whatever the reason, missing values could be a serious problem and must be solved. Why? Well, some ML models can’t work with missing data. But unlike outliers, identifying missing values is quite simple. But we handle these missing values makes all the difference. There certainly are some standard techniques of handling missing data, but which one to choose in your case would require some domain knowledge (There, I said it again). Without any further ado, let’s look at these "standard" techniques.
1. Remove the rows with missing values
This is the simplest of all other techniques. But it is feasible only if the records with missing values are significantly low. If the number of records with missing values is high, simply removing them would cause considerable data loss across the dataset.
2. Fill them with a global constant
Another technique often used to fill the missing values is using a global constant, like zero (0). This is only done when the missing value could not be properly predicted. Missing values could also be filled with 0 when the context is known (Sometimes instead of recording a value as zero, they are recorded as null values).
3. Fill the missing values with mean or median
This is also a common practice while preparing the data. The missing values are filled with the mean or median of the particular attribute which is concerned. The downside of this approach, however, is it leaves out the context from the other attributes in the dataset.
4. Use Forward or Backward filling
The missing values are filled with the value from the previous row or the next row. This technique is called forward filling (ffil) or backward filling (bfil).
5. Create a model to fill the missing values
This is the most effective, yet complicated technique. In this technique, a Machine Learning model is created to learn patterns from the data and predict the missing values.
Handling Data Redundancy

In a real-world scenario, a database contains hundreds of tables interlinked with each other. Additionally, while gathering the required data, many data sources would be typically used (discussed in further steps). In such cases, there is a chance of data redundancy. But first, what is data redundancy? Data redundancy is a phenomenon where the same information or data is stored in multiple locations. One example could be, when joining two tables, employees’ table and salaries table, which both have employee ID, duplicate columns of the same information (employee ID) can be created in the result table. In such cases, this kind of redundant data has to be identified and removed. Correlation analysis can help in identifying redundant data.
Data Integration

In this digital age, data is being collected everywhere. From your google search to your heart rate in the fitness band you wear, data is everywhere. Ideally, any business would be having its data at multiple sources. As an example, an e-commerce site can have its customer details in its own database. They might parallelly use a CRM tool to manage their customer interactions with their site. They might also use a different tool for their customer support. They would be having a payment gateway for their digital payments. Different types of data are collected at all of these different sources. To go the extra mile, you could also make use of the Twitter tweets, where the store is mentioned by someone. To get a complete picture of what’s happening in the company, it is important to have a 360-degree view of the data. While all the data that you need is already available, it is not in a unified view. While preparing the data for an ML model, you would typically need to integrate all the data from all these different data sources into one single form of data. This stage of data preparation where you combine data from different sources is called data integration.
Data Transformation

Before feeding the data into an ML model, it is important to make the necessary transformations that are needed. There are again a few techniques or methods in data transformation.
Aggregation
Data aggregation is a process of storing the data in a summarized format. Data collected from several sources are first integrated and then data aggregation would be performed wherever and however required. This is an important step in the data transformation process as the quality of the model ultimately depends on the quality of the data (And that’s exactly the title of this article!).
Discretization
The data is converted into a set of finite intervals. As an example, age attributes can be discretized into finite intervals like (1–13),(13–19),(20–30), etc., or (Kids, Teens, Young). Binning is also often used as a discretization technique. Discretization can be used to significantly improve the efficiency of the model.
Attribute Creation
Oftentimes, new data attributes are created by using the already existing attributes from the dataset. An example could be, deriving the age attribute from the existing date of the birth attribute. This requires a sound piece of domain knowledge (Ah, did I say it again?!).
Normalization
This is a critical part of the data transformation process as the real-world data would be often measured on different scales. For a common example, the height of a person is measured in feet whereas the length of a road would be measured in miles or kilometers. This process of normalizing all the attributes in the dataset is called normalization. However, all the normalization requirements won’t be this straightforward. There are several normalization techniques like min-max normalization, z-score normalization, etc.
Final Thoughts
Data preparation is a critical step in the process of creating a machine learning or AI model. As the title suggests and as discussed above, the quality of your AI completely depends on the quality of the data that you feed it. A firm knowledge of the domain is crucial in successfully preparing the data, as discussed multiple times in this article.
Originally published on my blog at https://praneethvasarla.com on May 14, 2021