Data Science for Startups: Deep Learning

Published in

Towards Data Science

13 min readJun 23, 2018

Part ten of my ongoing series about building a data science discipline at a startup, and the first article ported from R to Python. You can find links to all of the posts in the introduction, and a book based on the R series on Amazon.

This blog post is a brief introduction to using the Keras deep learning framework to solve classic (shallow) machine learning problems. It presents a case study from my experience at Windfall Data, where I worked on a model to predict housing prices for hundreds of millions of properties in the US.

I recently started reading “Deep Learning with R”, and I’ve been really impressed with the support that R has for digging into deep learning. However, now that I’m porting my blog series to Python, I’ll be using the Keras library directly, rather than the R wrapper. Luckily, there’s also a Python version of the book.

Deep Learning with PythonSummary Deep Learning with Python introduces the field of deep learning using the Python language and the powerful…
www.amazon.com

One of the use cases presented in the book is predicting prices for homes in Boston, which is an interesting problem because homes can have such wide variations in values. This is a machine learning problem that is probably best suited for classical approaches, such as XGBoost, because the data set is structured rather than perceptual data. However, it’s also a data set where deep learning provides a really useful capability, which is the ease of writing new loss functions that may improve the performance of predictive models. The goal of this post is to show how deep learning can potentially be used to improve shallow learning problems by using custom loss functions.

One of the problems that I’ve encountered a few times when working with financial data is that often you need to build predictive models where the output can have a wide range of values, across different orders of magnitude. For example, this can happen when predicting housing prices, where some homes are valued at $100k and others are valued at $10M. If you throw standard machine learning approaches at these problems, such as linear regression or random forests, often the model will overfit the samples with the highest values in order to reduce metrics such as mean absolute error. However, what you may actually want is to treat the samples with similar weighting, and to use an error metric such as relative error that reduces the importance of fitting the samples with the largest values.

# Standard approach to linear regression
fit <- lm(y ~ x1 + x2 + x3 + ... + x9, data=df)# Linear regression with a log-log transformation 
fit <- nls( log10(y) ~ log(x1*b1 + x2*b2 + ... + x9*b9), 
   data = df, start = list(b1=1, b2=1, ... , b9 = 1))

I actually did this explicitly in R, using packages such as nonlinear least squares (nls). Python has a NLS library as well, but I didn’t explore this option while working on the housing problem. The code sample above shows how to build a linear regression model using the built-in optimizer, which will overweight samples with large label values, and the nls approach which shows how to perform a log transformation on both the predicted values and labels, which will give the samples relatively equal weight. The problem with the second approach is that you have to explicitly state how to use the features in the model, creating a feature engineering problem. An additional problem with this approach is that it cannot be applied directly to other algorithms, such as random forests, without writing your own likelihood function and optimizer. This is a for a specific scenario where you want to have the error term outside of the log transform, not a scenario where you can simply apply a log transformation to the label and all input variables.

Deep learning provides an elegant solution to handling these types of problems, where instead of writing a custom likelihood function and optimizer, you can explore different built-in and custom loss functions that can be used with the different optimizers provided. This post will show how to write custom loss functions in Python when using Keras, and show how using different approaches can be beneficial for different types of data sets. I’ll first present a classification example using Keras, and then show how to use custom loss functions for regression.

The image below is a preview of what I’ll cover in this post. It shows the training history of four different Keras models trained on the Boston housing prices data set. Each of the models use different loss functions, but are evaluated on the same performance metric, mean absolute error. For the original data set, the custom loss functions do not improve the performance of the model, but on a modified data set, the results are more promising.

Performance of the 4 loss functions on the original housing prices data set. All models used MAE for the performance metric.

Installation

The first step in getting started with Deep Learning is setting up an environment. I covered setting up Jupyter on an AWS EC2 instance in my past post. We’ll install two additional libraries for Python: tensorflow and keras. Also, it’s useful to spins up a larger machine, such as t2.xlarge, when working on deep learning problems. Here’s the steps I used to set up a Deep Learning environment on EC2. However, this configuration does not support GPU acceleration.

# Jupyter setup 
sudo yum install -y python36
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python36 get-pip.py
pip3 install --user jupyter# Deep Learning set up 
pip3 install --user tensorflow
pip3 install --user keras
pip3 install --user  matplotlib
pip3 install --user  pandas# Launch Jupyter
jupyter notebook --ip Your_AWS_Prive_IP

Once you have connected to Jupyter, you can test your installation by running the following commands:

import keras
keras.__version__

The output should print that the TensorFlow backend is being used.

Classification with Keras

To get started with deep learning, we’ll build a binary classifier that predicts which users are most likely to purchase a specific game, given past purchases. We’ll use the data set that I presented in my post on recommender systems. The rows in the data set contains a label indicating if the player purchased the game, and a list of other games with values of 0 or 1 indicating purchases of other titles. The goal is predicting which users will purchase the game. The complete notebook for the code presented in this section is available here.

The general process for building models with Keras is:

Set up the structure of the model
Compile the model
Fit the model
Evaluate the model

I’ll discuss each of these steps in more detail below. First, we need to include the necessary libraries for keras and plotting:

import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
from keras import models, layers
keras.__version__

Next, we download the data set and create training and test data sets. I’ve held out 5000 samples that we’ll use as a holdout data set. For the training data set, I split the data frame into input variables (x) and labels (y).

df = pd.read_csv(
"https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")train = df[5000:]
test = df[:5000]x = train.drop(['label'], axis=1)
y = train['label']

Now we can create a model to fit the data. The model below uses three layers of fully-connected neurons with relu activation functions. The input structure is specified in the first layer, which needs to match the width of the input data. The output is specified as a signal neuron with a sigmoid activation, since we are preforming binary classification.

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10,)))
model.add(layers.Dropout(0.1))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dropout(0.1))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

Next, we specify how to optimize the model. We’ll use rmsprop for the optimizer and binary_crossentropy for the loss function. Instead of using accuracy for the metric, we’ll use ROC AUC since the data set has a large class imbalance. In order to use this metric, we can use the auc function provided by tensorflow.

def auc(y_true, y_pred):
    auc = tf.metrics.auc(y_true, y_pred)[1]
    keras.backend.get_session().run(
        tf.local_variables_initializer())
    return auc
    
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',metrics=[auc]

The last step is to train the model. The code below shows how to fit the model using the training data set, 100 training epochs with a batch size of 100, and a cross validation split of 20%.

history = model.fit(x,
                    y,
                    epochs=100,
                    batch_size=100,
                    validation_split = .2,
                    verbose=0)

The progress of the model will be display during training if verbose is set to 1 or 2. To plot the results, we can use matplotlib to display the loss values of the training and test data sets:

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)plt.figure(figsize=(10,6)) 
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.legend()
plt.show()

The resulting plot is shown below. While the loss value for the training data set continued to decrease with more epochs, the loss on the test data set flattened out after about 10 epochs.

Plotting the loss values for the binary classifier.

We can also plot the value of the AUC metric after each epoch, as shown below. Unlike the loss value, the AUC metric of the model on the test data set continued to improve with additional training.

Plotting the AUC metric for the binary classifier.

A final step is evaluating the performance of the model on the holdout data set. The loss value and AUC metric can be calculated for the holdout data using the code shown below, which results in an AUC of ~0.82.

x_test = test.drop(['label'], axis=1)
y_test = test['label']results = model.evaluate(x_test, y_test, verbose = 0)
results

This section discussed building a simple classifier using a deep learning model with the Keras framework. Generally, deep learning won’t perform as well as XGBoost on shallow learning problems like this, but it’s still a useful approach to explore. In the next section, I discuss how custom loss functions can be used to improve model training.

Custom Loss Functions

One of the great features of deep learning is that it can be applied to both deep problems with perceptual data, such as audio and video, and shallow problems with structured data. For shallow learning (classic ML) problems, you can often see improvements over shallow approaches, such as XGBoost, by using a custom loss function that provides a useful singal.

However, not all shallow problems can benefit from deep learning. I’ve found custom loss functions to be useful when building regression models that need to create predictions for data with different orders of magnitude. For example, predicting housing prices in an area where the values can range significantly. To show how this works in practice, we’ll use the Boston housing data set provided by Keras:

Datasets — Keras DocumentationDataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed…
keras.io

This data set includes housing prices for a suburb in Boston during the 1970s. Each record has 13 attributes that describe properties of the home, and there are 404 records in the training data set and 102 records in the test data set. In R, the dataset can be loaded as follows: boston_housing.load_data(). The labels in the data set represent the prices of the homes, in thousands of dollars. The prices range from $5k to $50k, and the distribution of prices is shown in the histograming on the left. The original data set has values with similar orders of magnitude, so custom loss functions may not be useful for fitting this data. The histogram on the right shows a transformation of the labels which may benefit from using a custom loss.

The Boston data set with original prices and the transformed prices.

To transform the data, I converted the labels back into absolute prices, squared the result, and then divided by a large factor. This results in a data set where the difference between the highest and lowest prices is 100x instead of 10x. We now have a prediction problem that can benefit from the use of a custom loss function. The Python code to generate these plots is shown below.

# Original Prices
plt.hist(y_train) 
plt.title("Original Prices") 
plt.show()# Transformed Prices
plt.hist((y_train*1000)**2/2500000) 
plt.title("Transformed Prices") 
plt.show()

Loss Functions in Keras

Keras includes a number of useful loss function that be used to train deep learning models. Approaches such as mean_absolute_error() work well for data sets where values are somewhat equal orders of magnitude. There’s also functions such as mean_squared_logarithmic_error() which may be a better fit for the transformed housing data. Here are some of the loss functions provided by Keras:

mean_absolute_error()
mean_absolute_percentage_error()
mean_squared_error()
mean_squared_logarithmic_error()

To really understand how these work we’ll need to jump into the Python losses code. The first loss function we’ll explore is the mean squared error, defined below. This function computes the difference between predicted and actual values, squares the result (which makes all of the values positive), and then calculates the mean value. Note that the function uses backend operations that operate on tensor objects rather than Python primitives.

def mean_squared_error(y_true, y_pred):    
    return K.mean(K.square(y_pred - y_true), axis=-1)

The next built-in loss function we’ll explore calculates the error based on the difference between the natural log of the predicted and target values. It is defined here and shown below. The function uses the clip operation to make sure that negative values are not passed to the log function, and adding 1 to the clip result makes sure that all log transformed inputs will have non-negative results. This function is similar to the one we will define.

def mean_squared_logarithmic_error(y_true, y_pred):    
    first_log = K.log(K.clip(y_pred, K.epsilon(), None) + 1.)
    second_log = K.log(K.clip(y_true, K.epsilon(), None) + 1.)    
    return K.mean(K.square(first_log - second_log), axis=-1)

The two custom loss functions we’ll explore are defined in the Python code segment below. The first function, mean log absolute error (MLAE), computes the difference between the log transform of the predicted and actual values, and then averages the result. Unlike the built-in function above, this approach does not square the errors. One other difference from the log function above is that this function is applying an explicit scaling factor to the data, to transform the housing prices back to their original values (5,000 to 50,0000) rather than (5, 50). This is useful, because it reduces the impact of adding +1 to the predicted and actual values.

from keras import backend as K# Mean Log Absolute Error
def MLAE(y_true, y_pred):    
    first_log = K.log(K.clip(y_pred*1000, K.epsilon(), None) + 1.)
    second_log = K.log(K.clip(y_true*1000, K.epsilon(), None) + 1.) 
    return K.mean(K.abs(first_log - second_log), axis=-1)# Mean Squared Log Absolute Error
def MSLAE(y_true, y_pred):    
    first_log = K.log(K.clip(y_pred*1000, K.epsilon(), None) + 1.)
    second_log = K.log(K.clip(y_true*1000, K.epsilon(), None) + 1.)
    return K.mean(K.square(first_log - second_log), axis=-1)

Like the Keras functions, the custom loss functions need to operate on tensor objects rather than Python primitives. In order to perform these operations, you need to get a reference to the backend using the from statement. In my system configuration, this returns a reference to tensorflow.

The second function computes the square of the log error, and is similar to the built in function. The main difference is that I’m scaling the values, which is specific to the housing data set.

Evaluating Loss Functions

We now have four different loss functions that we want to evaluate the performance of on the original and transformed housing data sets. This section will walk through loading the data, compiling a model, fitting the model, and evaluating performance. The complete code listing for this section is available on github.

After following the installation steps in the prior section, we’ll load the data set and apply our transformation to skew housing prices. The last two operations can be commented out to use the original housing prices.

# load the data set
from keras.datasets import boston_housing 
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()# transform the training and test labels
y_train = (y_train*1000)**2/2500000
y_test = (y_test*1000)**2/2500000

Next, we’ll create a Keras model for predicting housing prices. I’ve used the network structure from the sample problem in “Deep Learning with R”. The network includes two layers of fully-connected relu activated neurons, and an output layer with no transformation.

# The model as specified in "Deep Learning with R"
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', 
          input_shape=(x_train.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))

To compile the model, we’ll need to specify an optimizer, loss function, and a metric. We’ll use the same metric and optimizer for all of the different loss functions. The code below defines a list of loss functions, and for the first iteration the model uses mean squared error.

# Compile the model, and select one of the loss functions
losses = ['mean_squared_error', 'mean_squared_logarithmic_error',
           MLAE, MSLAE]model.compile(optimizer='rmsprop',
              loss=losses[0],
              metrics=['mae'])

The last step is to fit the model and then evaluate the performance. I used 100 epochs with a batch size of 5, and a 20% validation split. After training the model on the training data set, the performance of the model is evaluated using the mean absolute error on the test data set.

# Train the model with validation
history = model.fit(x_train,
                    y_train,
                    epochs=100,
                    batch_size=5,
                    validation_split = .2,
                    verbose=0)# Calculate the mean absolute error
results = model.evaluate(x_test, y_test, verbose = 0)
results

After training the model, we can plot the results using matplotlib. The plot below shows the loss values for the training and testing data sets.

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)plt.figure(figsize=(10,6)) 
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.legend()
plt.show()

Loss values for the training and validation data sets.

I trained four different models with the different loss functions, and applied this approach to both the original housing prices and the transformed housing prices. The results for all of these different combinations are shown below.

Performance of the Loss Function of the Housing Price Data Sets

On the original data set, applying a log transformation in the loss function actually increased the error of the model. This isn’t really surprising given that the data is somewhat normally distributed and within a single order of magnitude. For the transformed data set, the squared log error approach outperformed the mean squared error loss function. This indicates that custom loss functions may be worth exploring if your data set doesn’t work well with the built-in loss functions.

The model training histories for the four different loss functions on the transformed data set are shown below. Each model used the same error metric (MAE), but a different loss function. One surprising result was that the validation error was much higher for all of the loss functions that applied a log transformation.

Performance of the 4 loss functions on the transformed housing prices data set. All models used MAE for the performance metric.

Deep learning can be a useful tool for shallow learning problems, because you can define custom loss functions that may substantially improve the performance of your model. This won’t work for all problems, but may be useful if you have a prediction problem that doesn’t map well to the standard loss functions.

Ben Weber is a principal data scientist at Zynga. We are hiring!

Data Science for Startups: Deep Learning

Deep Learning with Python

Summary Deep Learning with Python introduces the field of deep learning using the Python language and the powerful…

Installation

Classification with Keras

Custom Loss Functions

Datasets — Keras Documentation

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed…

Loss Functions in Keras

Evaluating Loss Functions

Written by Ben Weber