Table of Content
- Introduction
- What is Data Preparation
- Exploratory Data Analysis (EDA)
- Data Preprocessing
- Data Splitting
Introduction
Before we get into this, I want to make it clear that there is no rigid process when it comes to data preparation. How you prepare one set of data will most likely be different from how you prepare another set of data. Therefore this guide aims to provide an overarching guide that you can refer to when preparing any particular set of data.
Before we get into the guide, I should probably go over what Data Preparation is…
What is Data Preparation?
Data preparation is the step after data collection in the Machine Learning life cycle and it’s the process of cleaning and transforming the raw data you collected. By doing so, you’ll have a much easier time when it comes to analyzing and modeling your data.
There are three main parts to data preparation that I’ll go over in this article:
- Exploratory Data Analysis (EDA)
- Data preprocessing
- Data splitting
1. Exploratory Data Analysis (EDA)
Exploratory data analysis, or EDA for short, is exactly what it sounds like, exploring your data. In this step, you’re simply getting an understanding of the data that you’re working with. In the real world, datasets are not as clean or intuitive as Kaggle datasets.
The more you explore and understand the data you’re working with, the easier it’ll be when it comes to data preprocessing.
Below is a list of things that you should consider in this step:
Feature and Target Variables
Determine what the feature (input) variables are and what the target variable is. Don’t worry about determining what the final input variables are, but make sure you can identify both types of variables.
Data Types
Figure out what type of data you’re working with. Are they categorical, numerical, or neither? This is especially important for the target variable, as the data type will narrow what machine learning model you may want to use. Pandas functions like df.describe() and df.dtypes are useful here.
Check for Outliers
An outlier is a data point that differs significantly from other observations. In this step, you’ll want to identify outliers and try to understand why they’re in the data. Depending on why they’re in the data, you may decide to remove them from the dataset or keep them. There are a couple of ways to identify outliers:
- Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier. Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
- Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5IRQ or greater than Q3 + 1.5IQR. This comes to approximately 2.698 standard deviations.
Ask Questions
There’s no doubt that you’ll most likely have questions regarding the data that you’re working with, especially for a dataset that is outside of your domain knowledge. For example, Kaggle had a competition on NFL analytics and injuries, I had to do some research and understand what the different positions were and what their function served for the team.
2. Data Preprocessing
Once you understand your data, a majority of your time spent as a data scientist is on this step, data preprocessing. This is when you spend your time manipulating the data so that it can be modeled properly. Like I said before, there is no universal way to go about this. HOWEVER, there are a number of essential things that you should consider which we’ll go through below.
Feature Imputation
Feature Imputation is the process of filling missing values. This is important because most machine learning models don’t work when there are missing data in the dataset.
One of the main reasons that I wanted to write this guide is specifically for this step. Many articles say that you should default to filling missing values with the mean or simply remove the row, and this is not necessarily true.
Ideally, you want to choose the method that makes the most sense – for example, if you were modeling people’s age and income, it wouldn’t make sense for a 14-year-old to be making a national average salary.
All things considered, there are a number of ways you can deal with missing values:
- Single value imputation: replacing missing values with the mean, median, or mode of a column
- Multiple value imputation: modeling features that have missing data and imputing missing data with what your model finds.
- K-Nearest neighbor: filling data with a value from another similar sample
- Deleting the row: this isn’t an imputation technique, but it tends to be okay when there’s an extremely large sample size where you can afford to.
- Others include: random imputation, moving window, most frequent, etc…
Feature Encoding
Feature encoding is the process of turning values (i.e. strings) into numbers. This is because a machine learning model requires all values to be numbers.
There are a few ways that you can go about this:
- Label Encoding: Label encoding simply converts a feature’s non-numerical values into numerical values, whether the feature is ordinal or not. For example, if a feature called car_colour had distinct values of red, green, and blue, then label encoding would convert these values to 1, 2, and 3 respectively. Be wary when using this method because while some ML models will be able to make sense of the encoding, others won’t.
- One Hot Encoding (aka. get_dummies): One hot encoding works by creating a binary feature (1, 0) for each non-numerical value of a given feature. Reusing the example above, if we had a feature called car_colour, then one hot encoding would create three features called car_colour_red, car_colour_green, car_colour_blue, and would have a 1 or 0 indicating whether it is or isn’t.
Feature Normalization
When numerical values are on different scales, eg. height in centimeters and weight in lbs, most machine learning algorithms don’t perform well. The k-nearest neighbors algorithm is a prime example where features with different scales do not work well. Thus normalizing or standardizing the data can help with this problem.
- Feature normalization rescales the values so that they’re within a range of [0,1]/
- Feature standardization rescales the data to have a mean of 0 and a standard deviation of one.
Feature Engineering
Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step but here are some things that you can consider:
- Converting a DateTime variable to extract just the day of the week, the month of the year, etc…
- Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.)
- Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called "Is_women_or_child" which was True if the person was a woman or a child and false otherwise.
Feature Selection
Next is feature selection, which is choosing the most relevant/valuable features of your dataset. There are a few methods that I like to use that you can leverage to help you with selecting your features:
- Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most "important" in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
- Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
Dealing with Data Imbalances
One other thing that you’ll want to consider is data imbalances. For example, if there are 5,000 examples of one class (eg. not fraudulent) but only 50 examples for another class (eg. fraudulent), then you’ll want to consider one of a few things:
- Collecting more data – this always works in your favor but is usually not possible or too expensive.
- You can over or undersample the data using the scikit-learn-contrib Python package.
3. Data Splitting
Last comes splitting your data. I’m just going to give a very generic framework that you can use here, that is generally agreed upon.
Typically you’ll want to split your data into three sets:
- Training Set (70–80%): this is what the model learns on
- Validation Set (10–15%): the model’s hyperparameters are tuned on this set
- Test set (10–15%): finally, the model’s final performance is evaluated on this. If you’ve prepared the data correctly, the results from the test set should give a good indication of how the model will perform in the real world.
Thanks for Reading!
I hope you’ve learned a thing or two from this. By reading this, you should now have a general framework in mind when it comes to data preparation. There are many things to consider, but having resources like this to remind you is always helpful.
If you follow these steps and keep these things in mind, you’ll definitely have your data better prepared, and you’ll ultimately be able to develop a more accurate model!
Terence Shin
- Check out my free data science resource with new material every week!
- If you enjoyed this, follow me on Medium for more
- Let’s connect on LinkedIn