Deeplearning with Tabular Data: Visualization, Data Processing, Categorical Embeddings

sisil mehta
Towards Data Science
5 min readOct 6, 2018

--

(Reach out for collaborating on projects)

As with other machine learning algorithms, it’s important to understand your data when building deep learning networks. Let’s use a simple tabular dataset to visualize the data, draw conclusions and how different processing techniques can improve the performance of your deep learning model.

The King County House Prices dataset has 21613 data points about the sale prices of houses in the King County. It has about 19 feature columns shown below. It’s a mix of date, numerical and categorical data. Using the data out of the box, we will get an MAE of 149167.61. However, with some data processing, we can reach an MAE as low as 69471.41. That’s more than a 50% improvement.

Data Visualization

Let's dive into the data and see if we can get some insights into the data before we build our model. We will be using the Python Statistical Visualization library, Seaborn, which is built on top of matplotlib. Below is the distribution of each of the features w.r.t to the price.

  • The year of sale data is bimodal and has data from two years: 2014 and ‘15
  • Data such as no of bedrooms, no of bathrooms, Sqft living room, Sqft above, Sqft basement, sale price are unimodal and we can already tell that for houses having a higher value for these features, the model may have a higher mean average error.

Let’s look at the correlation of the feature w.r.t the label column, that is, the sale price of the house.

Building a deep learning model

I’ll be using the keras 2.0 library to build a sequential neural network to learn a regression model for predicting the house prices.

Import the dataset and then divide it into a training, validation and test set.

The model has two hidden dense layers and a final linear combination layer that produces that regression output. We’ll use the mean squared error as the loss function.

Using the data just out of the box gives us the MAE of 149167.61. Now let's look at some data processing techniques and how they can help improve the performance.

Whitening the data

Some of our features are of the order of 0.1, some of the order of 10, some of the order of 100s and some of 10000. This disparity in the order of the different numeric values can cause higher order values to dominate other values. ‘Whitening’ the data helps normalize the order of the different values.

The whitening operation takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale. The geometric interpretation of this transformation is that if the input data is a multivariable Gaussian, then the whitened data will be a Gaussian with zero mean and identity covariance matrix. Assuming our data should be a Gaussian distribution, we’ll compute the z-score for each column.

An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation/test data.

Once we re-run the model with the normalized data, our MAE does down to 99622.24. This is a huge improvement over using the data as is.

Converting ‘long’ and ‘lat’ to distance

While longitude and latitude can be used as is, we can get some more insight by transforming longitude and latitude into the distance of the house from a fixed location and booleans representing the direction where the house sits relative to the fixed location.

Using the new features ‘distance’, ‘greater_long’, ‘less_long’, ‘greater_lat’ and ‘less_lat’, we train the model again over the dataset. The resulting MAE over the training set goes further down to 87231.59.

Working with categorical data

Zip code is a feature column in the data. We have been treating it as a normal numerical value, however, the different zipcode values don't have an ordinal relationship. Also, they encode some deeper relationship such as, say, some zip codes have more expensive houses because they are closer to schools or transportation etc.

Categorical data can be expressed richly with the help of embeddings. You can read more about embeddings in this article from fast.ai.

In order to use embeddings in Keras 2.0, we have to use the functional API. One of the inputs will be the zip code data, which we will convert into an embedding. The other input will be a vector of the other features.

This model is easier to follow if we just print out the model and visualize it.

Training this model and running it on the test dataset, the MAE drops drastically to 69471.41.

Let’s visualize the zipcode embeddings using the TSNE library as so:

We see that there is a cluster of zipcodes towards the left that has houses that had a higher sale price.

Conclusion

So while we started off with an MAE of 149167.61, after whitening the data, processing it, using categorical embeddings, we have reduced the MAE to 69471.41.

--

--