The Definitive Guide for Data Preparation that Beginners should read

The best tips and steps to start any Data Science project from scratch.

Adrian PD

Published in

Towards Data Science

14 min readDec 8, 2020

In my humble opinion, the different data scientists can be classified into:

People focus on testing many different algorithms but not paying attention to the distribution that the algorithms require. The algorithm is a blackbox for them, they just memorize the recipes.
Old school scientists who address all challenges with wonderful (but rare) R algorithms, without considering how the problem scales with real (and huge) datasets or how difficult is to find those libraries in Spark.
Fancy guys who directly decline using linear regressions because Ockham was wrong and, for them, the simplest way to do things is never the right one.
People obsessed on drawing the best plots ever, in order to tell the story behind the numbers, but using the same models over and over.
The good ones. The Data Scientists who pay attention to every detail, from using efficient data structures in their notebooks till understanding what the algorithm does, taking care of documentation, writing clean code and thinking in advance about the later deployment stage. Sadly, they are a small portion of the cake.

Which one you want to be? I am sure you are answering “the good one”. However, to get such expertise, it is necessary to put the cart before the horse. The first stage in any Data Science project is the Data Preparation stage. Most people pay much more attention to later stages of the data pipeline, rather than spending their time on analyzing and preparing the data. Training a Machine Learning model is easy, the challenging part is to understand the distribution of your data and choosing the correct algorithm based on that distribution (or preparing your data for the chosen algorithm). The Data Preparation stage, with a common sense dose, is still no replaceable by automatic tools and let Data Scientists earn money.

The goal of this article is to give you some tips: how to process the data of your project before starting thinking in models + how to process the data after you have chosen the model.

Tip 1: Plot data. Never rely on aggregated metrics.

First of all, you should review your statistical notes and completely understand what the mode, the median and the mean are. Additionally, you should also review the concepts of skewed and long-tail distributions. Why do I say this?

Most beginners try to figure out how the data are distributed just by analyzing the values of their aggregated metrics. However, if you have read about the Anscombe’s quartet, you know that the mean/variance/correlation of your data does not tell us anything. If you haven’t, let me show you the following four distributions:

The four datasets have the same mean, variance and correlation. Nevertheless, the reader can observe that the four distributions are completely different: the first one matches with a linear regression; the second one should be approximated by a polynomial; while the third one follows a linear growth where the regression slope is slightly affected by outliers, and finally the last one is centered on x=8 but the effect of one outlier clearly introduces noise into the regression. Only plotting, or carefully analyzing, the data you will be able to realize about this issue.

Tip 2. Do hypothesis testing.

Can I assume that my data follow a normal distribution? This is a requisite in many algorithms. Do these two samples belong to the same population distribution? Especially useful when preprocessing data and you are not 100% sure about the data sources, or when working with stratified sampling. Does my recommendation engine work well?

There are many scenarios where testing your hypothesis at some level of significance is critical, and this is a must. Let me summarize some tests which you can find useful, there is life beyond Kolmogorov-Smirnov and T-student tests:

Additionally, before doing your A/B testing, I strongly recommend you to run a power analysis with the expected results in order to obtain the minimal sample size of your test/control groups.

Related with the hypothesis testing, it is very important to understand the Central Limit Theorem (CLT):

Take a sample containing many observations, which are independent and whose variance is neither null nor infinite. The mean of the samples approximates a normal distribution as the sample size gets larger.

There is no assumption on the probability distribution of the variables; the theorem remains true no matter what distributions are. This very powerful theorem is core in Data Science, since it help us to making inference about our samples.

In addition to CLT, understanding Confindence Intervals (CI) is necessary. Once we have obtained a metric value from our sample data, we can infer the approximated value for this metric in the population. This is done at a given confidence level (generally at 99%, 95% or 90%). If I take an infinite number of samples and I calculate the confidence interval for a metric, then the 99/95/90% of them will contain the real population metric. Thus, a 1/5/10% of samples does not contain the real population value for that metric. This should not be ignored, we cannot guarantee that the real value is there.

Let me show you an interesting shortcut that I have applied for obtaining the minimal size when the whole data is available (millions of rows), in order to train a model (when computing resources are limited, time benefits from small sample sizes) that works on average values. If I know that my dataset follows a normal distribution and the features are (mostly) independent each other, I can use the formula for getting the CI on the population mean when the standard deviation is known. For that, I just take the most important feature (column) of my whole dataset. Then, I obtain its average (let’s say 32.87), its sd (26.59) and the margin of error which I am willing to assume (for example 3%). Et voilà, I can save time and get a reliable result by taking a sample size of 2793 elements:

A trick when my computation is bounded by dataset size and certain requesites are fulfilled (Image by author)

Step 1. Inspect carefully each feature (column) of your dataset

If you are reaching this point of the text, I assume you have read Tip 1 and Tip 2. Now, it is time to put in practice all the theory. I will focus on R in my examples, but do not worry, it exists the equivalent syntax on Python.

In addition to use the command summary(), plotting each feature of the dataset is mandatory, as it will help us to understand the distribution of data. For that, I always recommend the Histogram, the Density Plot, the Box Plot, and the Bar Plot. Furthermore, it is also a good idea to use the Missing Plot.

The histogram is quite useful for detecting outliers, although it is only applied for numerical attributes. The command hist in R will allow you to execute it. In order to improve your analysis, you should also draw a Density Plot, with command plot(density(…)) , since it is a smoothed version of the histogram that will help you to recognize the distribution shape, not affected by bins. Also, you can execute a Bar plot analysis (command barplot) for categorical features.

Histogram and Density Plot (Image by author)

The Box plot is much more interesting (command boxplot), it will tell us about the symmetry, if data are skewed and how data are grouped. The interquartile range (IQR) is the difference between the upper quartile and the lower quartile (Q1-Q3); while whiskers are an extent of data up to 1.5*IQR. Outside this limit, we obtain the candidates for outliers. In the case of a normal distribution, as we see in the figure below, 50% of data is contained into the IQR; while outliers are 0.7% of the data.

The Box-Plot on a Normal distribution (Source: KDNuggets)

After this, you can continue your analysis with a Missing plot (command missmap). It helps to see the portion of missing data in your dataset, but also to easily identify the features with higher NA occurrences. The x-axis shows the features, while the y-axis shows the row number, that means, horizontal lines are the missing attributes for a row and vertical blocks shows the number of missing data for a specific feature.

Finally, there are more plots for showing the interactions of features with each other: Correlation plot, Scatterplot, etc.

Step 2. Formatting the data

At this point, you should be able to have an idea about the distribution, the outliers and the missing values of your data.

The first question is to be sure about the consistency of your data. For that, you should ask yourself:

All data follow business constraints? Validity
If data are missed, can I re-check the source? Completeness
Do values contradict each other? Consistency
Do the instances have the same unit of measure? Uniformity

After that, remove extra white spaces and pad strings values. Also, fix typos (you can find an interesting article in CRAN for that). As advice, the order of columns matters, especially when trying to do correlations or working with dummies. Try to sort your dataset, setting the numeric attributes at the beginning. It is a good practice to get sure you are not working with duplicated entries (use groupby or unique routines on your data).

The next thing to do is to work with missing values. This is one of the most challenging aspects of the Data Preparation part. Here you can find an extensive article. Basically, check if the missing values correspond to a pattern or if they are random among your instances. If random, you can choose between drop the whole instance or use statistical information for re-generating its value. Here, the median is a good candidate for numerical values, since it is more robust than the mean. The mean can be used if the data is not skewed. For categorical values, the mode is a good approximation.

How can you detect if the missing values of column A follow a pattern? Considering that all entries for your column B are not missed, take a sample from the instances which have no missing values for column A. Take also a sample from the instances with missing values for column A. Then, analyze the distribution of column B for both samples. Do they follow the same distribution? Repeat this operation for other columns with no missing values.

In case of finding a pattern (aka a cluster), you can use a KNN to re-generate the missing values, a linear regressor or using statistical metrics of the involved cluster.

What about outliers? The most accurate rule here is the common sense. And for the sake of common sense, it is necessary to understand the distribution of data before. Do not drop the outliers if the resulting model is critical (if you work at the NASA or Airbus or similar, please think twice before dropping anything). Otherwise, if the outlier is there because of a mistake during the process, then drop it. Drop the outliers if they do not affect to the accuracy or the final result. However, if dropping the outlier changes the results, run an analysis before and after dropping them in order to understand what is happening.

If you finally decide to drop the outlier, then you have two options. On the one hand, you can eliminate the whole row (instance), if you do not have complete certain about the cause but you have enough data. On the other hand, try to fill the gap. Median/Mode are good options in case of mistake when introducing the instance, but use the nearest value otherwise. If you have decided to keep the outlier, then transform the data: log, sqrt, softmax work well. Also PCA.

And how can you identify outliers? If your data follows a normal distribution, then Pierce’s Criterion, Chavenet’s Criterion, Grubb’s test or Dixon Q’s test. An easier way, the 99.7% rule: if your distribution is approximately normal, the sd, 2*sd and 3*sd ranges will compriss 99.7–95–68% of the dataset. Instances far away of the 3*sd with respect to the mean can be considered outliers. Another technique is to use the Box Plot we have seen above. For other kind of distributions, read about DBScan for detecting outliers, the isolation forest or the Random Cut Forest.

The 99.7% rule for detecting outliers in Normal distributions (Source: TowardsDataScience)

Regarding the question of working with the whole dataset or just a sample of it. I strongly recommend you to train and analyze all the data. Using sampling for saving time and costs was ok some years ago, but computing resources are not a limit anymore thanks to the Cloud. Sampling requires to make statistical inference about the population, and it will contain errors: selection bias / sampling error. Here you can find parametric and non-parametric methods for inference and simple size calculations.

As you likely know, in order to work with categorical values, we use dummy variables, which are artificial variables created to represent an attribute with 2 or more distinct levels. This way of coding is also known as one-hot encoding. If your data column has K different levels, then you must use K-1 dichotomous variables to represent it, as the K-th variable is redundant and creates multicollinearity.

Finally, the Feature Selection is another core of the Data Preparation stage. You can do this in several ways, using backward or forward techniques, but an interesting approach could be the Automatic Feature Elimination (basically backward). It iterates over the number of features. On each iteration, a model is trained and tested with the current dataset, and each attribute is scored by importance. The least important one is removed for the next iteration. Basically, it looks for attributes that contribute to the prediction on the target variable. This is a easy way to find the importance of your attributes. However, if the attributes are highly correlated, its ability to identify strong attributes might be affected.

Step 3. Transforming the data

Some Machine Learning algorithms need data in a specific way, whereas other algorithms can perform better if data are accurately transformed. However, these transforms are application specific. Here, a list of transformations you can find in R:

Scale: Divides each value by the sd of the attribute.
Center: Removes the mean of the attribute to each value.
Standardize: Scale + Center
Normalize: Scale values into the range [0,1]
Log, SQRT
Box-Cox: Transform non-normal dependent variables into normal shape. Only for positive values.
Yeo-Johnson: Similar to Box-Cox but considering negative and zero values.

For example, the use of the log transformation is very useful when you have a long-tail distribution (mainly affected by the effect of outliers, this is very common in Pareto distributions) and you want to use an algorithm that expects a normal curve. In such case, the log transformation can help. However, once you did the transformation, you should check if the resulting feature follows the normal curve. You can do it in two ways, comparing graphically what you got vs what you expected, or running a hypothesis test.

In the picture below, see how much money people spend on pizza depending on the season. As you can see, this is a tail distribution, few people spend a lot of money, while most people spend much less. If we apply log(x), the resulting distribution should look like a normal one. Actually, when you plot it, it seems so. But this is not enough, use a Kolmogorov-Smirnov to check if you can reject the normal hypothesis (the null hypothesis).

Transforming a tail distribution into normal with Log (Image by the author)

Another issue may be to work with unbalanced datasets, especially when working with classification algorithms (think about decision trees). In order to not fool ourselves, we should work with robust metrics like Precision, Recall, F1 Score, Kappa or ROC Curves (analyzing sensitivity and specificity), rather than the Accuracy. It may be also interesting to think in resampling for balancing the different classes of the dataset. Sampling up and down are common options, but please, also dive into the Rose and Smote techniques which create synthetic values that balance the dataset. In my experience, Smote works better than Rose, since it creates fewer unrealistic values as Rose. However, think about the consequences of doing this, it is not free at all.

Finally, as I briefly mentioned above, many algorithms can suffer if the attributes are highly correlated. One option is to find a high grade of correlation (>0.7) among attributes, and discard them. However, a most sophisticated way may be to transform the attributes in order to make them uncorrelated. This is possible with techniques such as Principal Component Analysis (aka PCA).

PCA is an important mechanism from the Algebra, which works with orthogonal projections, in order to produce linearly uncorrelated attributes. This may be understood as the way of taking a picture to someone. Depending on the angle of the photo, you can zoom in/zoom out each dimension of the body. If you find the correct angles (orthogonal projections in PCA), you will know how much each dimension contributes to the overall body (think in plan, elevation and section drawings).

Orthogonal projections (Source: builtin.com)

Before applying the PCA, the deviation is more or less the same for the projection of each instance directly on each axis (see Figure below). However, if we project each instance vector on an orthogonal basis (uncorrelated components), we will realize that some attributes (components) have more variance than others. The principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set.

An interesting application of PCA is the dimensionality reduction. Among all my components, I will only choose the ones with highest variance (as they represent the most significance in our dataset). Of course, the loss will be no important and we will still have most of the information that is carried by the components with highest variance. In any case, we should apply standardization prior to PCA, as this technique is quite sensitive about the variances of the initial variables. Another recommended techniques are LDA and Auto-Encoders, with which you can do the same. I invite you to go in greater detail with these methods.

To conclude, I would like to say that each project requires a great deal of context understanding and common sense, spending a lot of time in analyzing the data you have rather than being an expert in Data Science. Anyway, I think you can use these Tips and Steps as a first approach to fully understand the data with a guarantee of success.

Adrian Perez works as Data Scientist and has a PhD in Parallel Algorithms for Supercomputing. You can checkout more about his Spanish and English content in his Medium profile.

The Definitive Guide for Data Preparation that Beginners should read

The best tips and steps to start any Data Science project from scratch.

Written by Adrian PD