The world’s leading publication for data science, AI, and ML professionals.

Data Cleaning and Feature Engineering: The Underestimated Parts of Machine Learning

Data Cleaning and Feature Engineering is applied on the Messy Chocolate Bar Ratings Data Set

Photo by Vicky Yu on Unsplash
Photo by Vicky Yu on Unsplash

Introduction

Every one wants to jump into the field of machine learning. Every one wants to load the data and wants to directly train the machine learning algorithm on that data. In university, we only worked with clean data sets, like the MNIST or the CIFAR-10 data set. There, you can directly start with applying different machine learning algorithms and compare them. But in reality, you are not getting a perfect and cleaned data set. You normally get a quite messy data set that needs a lot of Data Cleaning and feature engineering before machine learning can be applied. In general, the machine learning itself is the smaller part of the whole process, while the data cleaning and feature engineering are larger parts. This article showcases data cleaning and feature engineering on the chocolate bar ratings data set from Kaggle. This data set is quite messy and allows to present some data cleaning ideas. The notebooks can be found on my github page. Some ideas for feature engineering are inspired by this notebook on Kaggle.

Data Set

The chocolate bar ratings data set contains information about chocolate bars and their ratings. The goal is to use the given information about a chocolate bar and to predict the rating of the chocolate bar. Table 1 shows all available Features and a short description of each feature.

Table 1: An overview of all features available in the chocolate bar ratings data set (table by author).
Table 1: An overview of all features available in the chocolate bar ratings data set (table by author).

In the description of the data set it is stated, that each chocolate is evaluated using a combination of objective qualities and subjective interpretation. The rating itself is done using a chocolate bar of one batch, with batch numbers, vintages and review dates being included in the data set. Table 2 shows an overview of the different rating levels and their meaning.

Table 2: An overview of all rating levels and their meaning (table by author).
Table 2: An overview of all rating levels and their meaning (table by author).

Exploratory Data Analysis, Data Cleaning and Feature Engineering

This chapter describes the process of exploring the data set, cleaning the data and creating some new features using feature engineering. The goal of this chapter is to prepare the data such that it can directly be used for Machine Learning afterwards. The data is loaded using Pandas and is stored in a Pandas data frame.

As a first step, the pandas data frame head function is used to get an initial view into the data. The column names are not well formatted and are containing newline characters. Figure 1 shows the head of the data frame.

Figure 1: Head of Pandas data frame containing the chocolate bar ratings data set (image by author).
Figure 1: Head of Pandas data frame containing the chocolate bar ratings data set (image by author).

So as a first step, the column names are renamed (Figure 2).

Secondly, the pandas data frame function info() is used in order to quickly check which data types are available and if data is missing. The columns _Company, Spec_Bean_Origin_or_Bar_Name, Cocoa_Percent, Company_Location, BeanType and _Broad_BeanOrigin are categorical features and have to be transformed. When looking at the missing values, then only the features _Broad_BeanOrigin and _BeanType are containing one missing value out of 1795 total samples. However, when looking at the data frame head (Figure 1), the first five rows of feature Bean_Type are empty and should be therefore count as missing value. Therefore, the first entry of _BeanType is fetched in order to check its value and to use this for replacing these values with NaN (Figure 3).

After replacing the empty values of _BeanType with NaN, the info() function is called again and shows now that the feature contains 891 missing values, which is almost 50% of the total number of samples. The feature _Broad_BeanOrigin also had some of these missing values and now contains 74 missing values.

As a next step, the feature _CocoaPercent is transformed by removing the trailing percentage sign and casting the data type to a numerical data type.

Afterwards, a list of numerical and categorical features is created and a histogram is plotted for each numerical feature (Figure 4 and Figure 5).

Figure 5: Histograms of numerical features of chocolate bar ratings data set (image by author).
Figure 5: Histograms of numerical features of chocolate bar ratings data set (image by author).

As one can see, the period of reviews in the data set goes from 2006 to 2017. The distribution of the feature _CocoaPercent seems to be a little bit skewed and can be transformed later. Most of the ratings are between three and four, meaning that most chocolate bars are "Satisfactory".

After closer examination of the categorical features, it can be seen that they contain a lot of categories (Table 3) and also partially single examples for some categories, which is not very helpful for the machine learning algorithm. Therefore, these are first to be transformed.

Table 3: An overview of all categorical features and the number of different categories for each feature (image by author).
Table 3: An overview of all categorical features and the number of different categories for each feature (image by author).

Transforming the Bean Type

The first transformed categorical feature is the _BeanType. Figure 6 shows the histogram of all available bean types.

Figure 6: Histogram of all the available bean types after the transformations (image by author).
Figure 6: Histogram of all the available bean types after the transformations (image by author).

The most frequent bean types are Criollo, Trinitario, Forsastero. The most of the other bean types are simply a mixture of these bean types. Blend also appears often, but only means that this bean type is a mixture of other bean types. Some others and less common bean types are Beniano, Matina, EET, Nacional, Amazon. Therefore, one column is created containing all the here mentioned bean types and each sample is then checked for the presence of these categories. If any of the categories occur, then a one is added to the category column for that sample. If a sample contains multiple bean types, multiple ones are set. Thus, the intermixed bean types can be mapped to categorical features. The bean types are also containing a lot of missing values. Here, the missing values are mapped to the category _Unknown_BeanType, because other data imputation methods can be hard when almost 50% of the data is missing.

It could be also useful know whether a chocolate bar contains a mixture of bean types or not. Maybe this already has an influence on the final rating. Therefore, a new feature called _NumBeans is added. This feature contains the number of different bean types a chocolate bar contains.

Finally, a feature _IsBlend is added to hold whether a chocolate bar contains a mixture of beans or not and the bean types with type Blend are also getting a true for this feature.

Figure 7 shows the code for performing the above mentioned steps.

Transforming the Specific Geo-Region of Origin for the Bar

The next feature that is transformed is the _Spec_Bean_Origin_or_BarName. There are 1039 different categories, which is a lot. There are also a lot of categories only containing one value, which would not deliver any useful information for the machine learning algorithm.

After closer examination, it gets clear that a lot of countries and cities are containing not useful information after a comma. Therefore, all row values are split using the comma and only the first part is kept. This already reduces the number of categories to 682 different categories. In addition, some categories are containing a "w/ nibs" after the country or city name. This is also removed and reduces the number of categories to 672. Finally, the single occurrences categories are mapped to the category "Other", which reduces the size of different categories to 209. Figure 8 shows the code for these transformations.

Transforming Broad Bean Origin

As a next feature, the broad bean origin (_Broad_BeanOrigin) is closer examined. In total, there are 100 different categories and some categories are again only a comma separated list of other more common categories. In addition, some countries are sometimes written differently (i.e. Dominican Republic, D.R., D. Republic, Dominican R.). Therefore, regular expressions are used in order to ensure that all different spellings are mapped to the same country.

The broad bean origin feature also contains 74 missing values. These missing values are replaced by the category _Unknown_BeanOrigin.

After formatting the countries and replacing the missing values, the comma separated list is split in order to get a list of categories, and then the same approach as for the feature _BeanType is applied.

After all the transformations, only 51 different categories are left. Figure 9 shows the final histogram after all transformations on that feature are applied. The code is not included here, because it is a lot of code. If you are interested to see how I made these transformations, feel free to check the jupyter notebook on my github page.

Figure 9: Histogram of all the available broad bean origins after the transformations (image by author).
Figure 9: Histogram of all the available broad bean origins after the transformations (image by author).

Transforming the Company Location

The feature _CompanyLocation is the last transformed categorical feature. Here, only the single occurrences of the company locations are again replaced with the category "Other".

Final Transformations

Finally, all the categorical columns left are transformed using pandas _getdummies() function.

As a last transformation, the skewed feature of cocoa percent is transformed using a log transformation. With this, the skewness is reduced from 1.06 to 0.3. Figure 10 shows the histograms before and after the applied transformation. As one can see, the transformed histogram looks way less skewed then the one before the transformation.

As a last step, the data is split into a training and a hold out test set by using 20% of the data for testing.

Figure 10: Histograms of numerical feature cocoa percent before the log transformation (left) and after the transformation (right) (image by author).
Figure 10: Histograms of numerical feature cocoa percent before the log transformation (left) and after the transformation (right) (image by author).

Conclusion

The messy data set of chocolate bar ratings is now cleaned and some maybe useful features are added. The categorical features are encoded and a training set plus a hold-out test set are created and stored in csv files. They can now directly be used in order to perform some machine learning on this data set. If you are interested to see how I applied machine learning on this data set, then I encourage you to take a look into the according jupyter notebook on my github page.

Future Work

Currently, single categories are mapped to a new category "Other". But maybe there is another way how the single categories can be even further reduced and mapped to more meaningful categories. The same applies for missing values, which are also mapped to the category "Other". Here, data imputation strategies could be used and the performance of the machine learning algorithms could be compared to the here presented strategies in order to check if this can further boost the performance.


Related Articles