
In Part 1 of this article I explained how we can obtain sleep data from Fitbit, load it into Python and preprocess the data to be ready for further analysis. In this part I will explain how and why we split the data into training, validation and test set, how we can select features for our Machine Learning models and then train three different models: Multiple Linear Regression, Random Forest Regressor and Extreme Gradient Boosting Regressor. I will briefly explain how these models work and define performance measures to compare their performance. Let’s get started.
Separating the data into training, validation and test set
Before we do any further analysis using our data we need to split the entire data set into three different subsets: training set, validation set and test set. The following image displays this process well:

The test set is also referred to as hold-out set and once we split it from the remaining data we do not touch it again until we have trained and tweaked our Machine Learning models to a point where we think they will perform well on data that they have never seen before.
We split the remaining data into a training and a validation set. This allows us to train our models on the training data and then evaluate their performance on the validation data. In theory, we can then go and tweak our models and evaluate them on the validation data again and thereby find ways to improve model performance. This process often leads to overfitting, meaning that we focus too much on training our model in a way that it performs well on the validation set but it performs poorly when used on a data set that it has never seen before (such as the test set).
In part 3 of this article I explain how we can reduce overfitting while making sure that the models still perform well. For now, we will follow the above approach of a simple split of the data set into training, validation and test set.
I want to split the data in a way that the training set is made up of 60% of the total data set and the validation and test set are both made up of 20%. This code achieves the correct percentage splits:
In the first test split the test_size parameter is set to 0.2, which splits the data into 80% training data and 20% test data. In order to split the 80% training data into training and validation data and ensuring that the validation data is 20% of the size of the original data set the test_size parameter needs to be 0.25 (20% is one quarter, or 0.25, of 80%).
Before moving on I want to emphasise one important thing here. It is crucial to split the data before performing any further transformations such as scaling the data because we want to prevent any information about the test set to spill over into our training and validation set. Data scaling is often done using statistics about the data set as a whole, such as mean and standard deviation. Because we want to be able to measure how well our Machine Learning models perform on data they have never seen before we have to make sure that no information from the test data impacts how the scaling or any other transformation is done.
Scaling features, defining performance metrics and a baseline
Although for the Machine Learning models in this project feature scaling is not required, it is considered best practice to scale features when comparing different models and their performance.
In this code, I use MinMaxScaler, which I fit on the training data and then use to scale the training, validation and test data:
Performance measures
Next, let’s define some performance measures that we can use to evaluate our models and compare them. Because Sleep Scores are a continuous variable (although only integer Sleep Scores are possible) the problem at hand is a regression problem. For regression problems there are many different measures of performance and in this analysis I will use Mean Absolute Error, Mean Squared Error and R-Squared. Additionally, I compute an accuracy of the predictions of the models.
Accuracy is typically used as a performance measure in classification problems and not regression problems because it refers to the proportion of correct predictions that the model makes. The ways I use accuracy for the regression models in this analysis is different. Accuracy for the regression models is a measure of how far off (in percentage terms) the predicted Sleep Score will be from the actual Sleep Score, on average. For example, if the actual sleep score is 80 and the model has an accuracy of 96%, meaning that on average it is 4% off, the model is expected to make a prediction for the sleep score in the range of 76.8 (80 – (80 x 0.04)) to 83.2 (80 + (80 x 0.04)).
Here is the function that evaluates a model’s performance that takes as inputs the model at hand, the test features and the test labels:
But how do we know what scores are good or bad for these different measures. For example, is an accuracy of 90% good or bad? What about R-squared? In order to have a reference point we will first come up with a baseline model that we can compare all later models and their performance to.
Baseline performance
In order to evaluate the Machine Learning models we are about to build we want to have a baseline that we can compare their performance to. Generally, a baseline is a simplistic approach that generates predictions based on a simple rule. For our analysis, the baseline model always predicts the median Sleep Score of the training set. If our Machine Learning model is not able to outperform this simple baseline it would be rather useless.
Let’s see what the performance of the baseline looks like:

While the accuracy may seem decent, looking at the other performance measures tells a very different story. The R-squared is negative, which is a strong indication of extremely poor model performance.
Now that we have split our data into different sub sets, have scaled the features, defined performance metrics and have come up with a baseline model we are almost ready to start training and evaluating our Machine Learning models. Before we move on to our models let’s first select the features that we want to use in those models.
Feature Selection using Lasso Regression
There are two questions that you might have after reading that heading: Why do we need to select features and what the hell is Lasso Regression?
Feature Selection
There are multiple reasons for selecting only a subset of the available features.
Firstly, feature selection enables the Machine Learning algorithm to train faster because it is using less data. Secondly, it reduces model complexity and makes it easier to interpret the model. In our case this will be important because apart from predicting Sleep Scores accurately we also want to be able to understand how the different features impact the Sleep Score. Thirdly, feature selection can reduce overfitting and thereby improve the prediction performance of the model.
In part 1 of this article we saw that many of the features in the sleep data set are highly correlated, meaning that the more features we use the more multicollinearity will be present in the model. This is generally speaking not an issue if we only care about prediction performance of the model but it is an issue if we want to be able to interpret the model. Feature selection will also help reduce some of that multicollinearity.
For more information on feature selection see this article.
Lasso Regression
Before we move on to Lasso Regression let’s briefly recap what a linear regression does. Fitting a linear regression minimises a loss function by choosing coefficients for each feature variable. One problem with that is that large coefficients can lead to overfitting, meaning that the model will perform well on the training data but poorly on data it has never seen before. This is where regularisation comes in.
Lasso Regression is a type of regularisation regression that penalises the absolute size of the regression coefficients through an additional term in the loss function. The loss function for a Lasso regression can be written like this:

The first part of the loss function is equivalent to the loss function of a linear regression, which minimises the sum of squared residuals. The additional part is the penalty term, which penalises the absolute value of the coefficients. Mathematically, this is equivalent to minimising the sum of squared residuals with the constraint that the sum of absolute coefficient values has to be less than a prespecified parameter. This parameter determines the amount of regularisation and causes some coefficients to be shrunk to close to, or exactly, zero.
In the above equation, λ is the tuning parameter which determines the strength of the penalty, i.e. the amount of shrinkage. Setting λ=0 would result in the loss function for a linear regression and as λ increases, more and more coefficients are set to zero and the remaining coefficients are therefore "selected" by the Lasso Regression as being important.
Fitting a Lasso regression on the training data and plotting the resulting coefficients looks like this:

The Lasso Regression algorithm has reduced the coefficients of Time in Bed and Minutes Light Sleep to close to zero, deeming them less important than the other four features. This comes in handy as we would face major multicollinearity issues if we included all of the features in our models. Let’s drop these two features from our data sets:
Now that we have selected a set of four features we can move on to building some Machine Learning models that will use those four features to predict Sleep Scores.
Multiple Linear Regression
In summary, Multiple Linear Regression (MLR) is used to estimate the relationship between one dependent variable and two or more independent variables. In our case, it will be used to estimate the relationship between Sleep Score and Minutes Asleep, Minutes Awake, Minutes REM Sleep and Minutes Deep Sleep. Note that MLR assumes that the relationship between these variables is linear.
Let’s train a MLR model and evaluate its performance:

All performance measures are substantially better than those of the baseline model (thank god). Especially the accuracy seems to be really high but this can be misleading, which is why it is important to consider multiple measures. One of the most important measures for regression performance is the R-squared. Generally speaking, the R-squared measures the proportion of the variance of the dependent variable that is explained by the independent variables. Hence, in our case it is a measure of how much of the variance in Sleep Scores is explained by our features. A value of roughly 0.76 is decent already but let’s see if we can do better by using different models.
Regression statistics
Before we move on to other Machine Learning models I would like to take a look at the regression output for the Multiple Linear Regression on our training data:

A few things to note regarding the regression output:
- All p-values are statistically significant.
- Minutes Asleep, Minutes REM Sleep and Minutes Deep Sleep have positive coefficients, meaning that an increase in these variables increases Sleep Scores.
- Minutes Awake has a negative coefficient, indicating that more time awake decreases the sleep score.
- Based on the magnitude of the coefficients, REM sleep seems to have a bigger positive impact on Sleep Score than Deep sleep.
The regression output provides a good starting point for understanding how the different sleep statistics may affect Sleep Score. More time asleep increases Sleep Score. This makes sense because more sleep (up until a certain point) will generally be beneficial. Similarly, more time spent in REM and Deep Sleep increase the Sleep Score as well. This also makes sense because both of these sleep stages provide important restorative benefits. For the computation of Sleep Score, Fitbit seems to consider REM sleep to be more important than Deep sleep (higher magnitude of the coefficient), which to me is one of the most interesting outcomes of the regression analysis. Finally, more time awake decreases Sleep Score. Again, that makes perfect sense because spending more time awake during one’s sleep window indicates restlessness and takes away from the restorative powers that time spent asleep provides.
For those people that are interested in understanding the importance of different sleep stages and of sleep in general, I highly recommend "Why We Sleep" by Matthew Walker. It is a brilliantly written book with fascinating experiments and insights!
All that being said, it is important to note that the interpretability of the above output is somewhat limited because of the correlation that is present between features. In Multiple Linear Regression, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one unit, holding all the other independent variables constant. In our case, because the independent variables are correlated, we could not expect one variable to change without the others changing and therefore cannot reliably interpret the coefficients in this way. Always look out for multicollinearity when interpreting your models!
Let’s see if other Machine Learning models perform better than Multiple Linear Regression.
Random Forest Regressor
Random Forests are one of the most popular Machine Learning models because of their ability to perform well on both classification and regression problems. In summary, a Random Forest is an ensemble technique that leverages multiple decision trees through Bootstrap Aggregation, also called "bagging". What exactly does that mean?
In order to understand this better we first need to understand how Decision Tree Regression works.
Decision Tree Regression
As the name suggests, decision trees build prediction models in form of a tree structure that may look like this:

In the above example the decision tree iteratively splits the data set based on various features in order to come up with a prediction of how many hours will be spent playing. But how does the tree know what features to split on first and which ones to split on further down the tree? After all, the predictions could be different if we change the sequence of the features used to make the split.
In a regression problem, the most common way to decide what feature to split the dataset on at a specific node is Mean Squared Error (MSE). The decision tree tries out different features that it can use to split the data set and computes the resulting MSEs. The feature that leads to the lowest MSE is chosen for the split at hand. This process is continued until the tree reaches a leaf (an end point) or a predetermined maximum depth. Maximum depths can be used to reduce overfitting because if a decision tree is allowed to continue until it finds a leaf, it may strongly overfit to the training data. Using maximum depths in this way is referred to as "pruning" of the tree.
There are two major limitations with decision trees:
- Greediness – Decision trees are not always globally optimal because we assume that the best way to create the tree is to find the feature that will result in the largest reduction in MSE now without considering whether a suboptimal split now could lead to an even better split further down the line ("greedy" strategy).
- Overfitting – The structure of the tree is often too dependent on the training data and pruning the tree often is not enough to overcome this issue.
Random Forests address both of those limitations.
Random Forests
As the "Forest" in Random Forest suggests, they are made up of many decision trees and their predictions are made by averaging the predictions of each decision tree in the forest. Think of this as a Democracy. Having only one person vote on an important issue may not be representative of how the entire community really feels, but collecting votes from many randomly selected members of the community may provide an accurate representation.
But what exactly does the "Random" in Random Forest represent?
In a Random Forest, every decision tree is created using a randomly chosen subset of the data points in the training set. This way every tree is different but all trees are still created from a portion of the same training data. Subsets are randomly selected with replacement, meaning that data points are "put back in the bag" and can be picked again for another decision tree.
In addition to choosing different random subsets for each tree, the decision trees in a Random Forest only consider a subset of randomly selected features at each split. The best feature is chosen for the split at hand and at the next node, a new set of random features is evaluated, etc.
By constructing decision trees using these "bagging" techniques, Random Forests address the limitations of individual decision trees well and manage to turn what would be a weak predictor in isolation into a strong predictor in a group, similar to the voting example.

Random Forest Regression in Python
Using the scikit-learn library in Python, most Machine Learning models are built in the same way. First, you initiate the model, then you train it on the training set and then evaluate it on the validation set. Here is the code:

Similar to the Multiple Linear Regression, the Random Forest performs vastly better than the baseline model. That being said, its R-squared and accuracy are lower than that of the MLR. So, what is all the hype around Random Forests about?
The answer to that question can be found here (hint: Hyperparameter Optimisation):
Cross-Validation and Hyperparameter Tuning: How to Optimise your Machine Learning Model
Extreme Gradient Boosting Regressor
Similar to Random Forests, Gradient Boosting is an ensemble learner, meaning that it creates a final model based on a collection of individual models, usually decision trees. What is different in the case of Gradient Boosting compared to Random Forests is the type of ensemble method. Random Forests use "Bagging" (described previously) and Gradient Boosting uses "Boosting".
Gradient Boosting
The general idea behind Gradient Boosting is that the individual models are built sequentially by putting more weight on instances with wrong predictions and high errors. The model therefore "learns from its past mistakes".
The model minimises a cost function through gradient descent. In each round of training, the weak learner (decision tree) makes a prediction, which is compared to the actual outcome. The distance between prediction and actual outcome represents the error of the model. The errors can then be used to calculate the gradient, i.e. the partial derivative of the loss function, to figure out in which direction to change the model parameters in order to reduce the error. The below graph visualises how this works:

The rate with which these adjustments will be made ("Incremental Step" in the above graph) can be set through the hyperparameter "learning rate".
Extreme Gradient Boosting
Extreme Gradient Boosting improves upon Gradient Boosting by computing the second partial derivative of the cost function, which aids in getting to the minimum of the cost function, as well as using advanced regularisation similar to that described using Lasso Regression, which improves model generalisation.
In Python, training and evaluating Extreme Gradient Boosting Regressor follows the same fitting and scoring process as the Random Forest Regressor:

The performance metrics are extremely close to that of the Random Forest, i.e. it performs decently but still not as well as our good old Multiple Linear Regression.
Where to go form here?
So far, we have not provided any hyperparameters in the Random Forest or Extreme Gradient Boosting Regressor. The respective libraries provide sensible default values for the hyperparameters of each model but there is no one-size-fits-all. By tweaking some of the hyperparameters we could potentially greatly improve the performance of these two models.
Furthermore, for our performance evaluation so far we have only relied on the models’ performances on one relatively small validation set. The performance is therefore highly dependent on how representative this validation set is of sleep data as a whole.
In the third part of this article I address both of these issues and boost the performance of the Random Forest and the Extreme Gradient Boosting Regressor. See here:
Cross-Validation and Hyperparameter Tuning: How to Optimise your Machine Learning Model