Does Machine Learning know when I’m scared? Anomaly detection in wearable health data

An opportunity to build autoencoders in Keras using real-life Fitbit measurements.

Elliot Humphrey
Towards Data Science

--

Image by Author

Wearable health devices like Fitbit offer large amounts of information about our fitness and day-to-day activities. This provides a great opportunity to work with large datasets, where personal information can be used to gain a deeper understanding of practical real-world machine learning applications beyond examples like the Human Activity Recognition dataset.

The dataset used in this article is from my personal Fitbit device, where calories and heart rate data are recorded as a time series over a period of nearly three weeks. The goal will be to build and use an autoencoder model using Keras to identify anomalies based on heart rate and calories over time.

Summary of work:

· We will begin with loading in the datasets and seeing how heart rate and calories vary during our sample period.

· We will also examine the dataset that contains anomalies to help us understand what potential anomalies could look like. This will come in useful once we begin to evaluate our model’s performance.

· Data will be cleaned, scaled and reshaped to be compatible with our autoencoder model.

· Our model will be built and evaluated to determine prediction error, that will later be used to determine whether anomalies exist.

· Finally, we will look into the detected anomalies to try and unravel why they might have occurred.

Loading and visualising:

The dataset was downloaded from my personal Fitbit account, where files are exported in a JSON format. Information about downloading your own data can be found here. We can start by loading these files to produce DataFrames of heart rate and calories, which are both time series data, using Pandas after we import all the relevant modules:

#Importing the relevant modules
import glob
import numpy
import os
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
from tensorflow import keras
from sklearn.preprocessing import MinMaxScaler
sns.set_palette(“hls”, 2)
#Loading in the heart rate and calories data
heart = pd.DataFrame()
calories = pd.DataFrame()
datasets = [‘heart_rate-2021–03-*.json’, ‘calories-2021–02-*.json’]
for datatype in datasets:
file_list=[]
path = ‘/Physical Activity/’
os.chdir(path)
for file in glob.glob(datatype):
file_list.append(file)
dfs = []
for file in file_list:
data = pd.read_json(path + file)
print(‘Reading: ‘ + str(file))
dfs.append(data)
concatenated = pd.concat(dfs, ignore_index=True)
concatenated[“value”] = concatenated[“value”].apply(str)
if concatenated[‘value’].str.contains(“bpm”).any():
heart = pd.concat([heart, concatenated], axis = 1)
else:
calories = pd.concat([calories, concatenated], axis = 1)

Let’s have a look at the data to see what kind of cleaning we may need to do:

print(heart.head(5))
print(calories.head(5))
dateTime value
0 2021–03–08 21:00:07 {‘bpm’: 57, ‘confidence’: 3}
1 2021–03–08 21:00:17 {‘bpm’: 56, ‘confidence’: 3}
2 2021–03–08 21:00:32 {‘bpm’: 56, ‘confidence’: 3}
3 2021–03–08 21:00:42 {‘bpm’: 57, ‘confidence’: 3}
4 2021–03–08 21:00:57 {‘bpm’: 58, ‘confidence’: 3}
dateTime value
0 2021–02–18 00:00:00 1.19
1 2021–02–18 00:01:00 1.19
2 2021–02–18 00:02:00 1.19
3 2021–02–18 00:03:00 1.19
4 2021–02–18 00:04:00 1.19

Heart rate is measured in beats per minute (BPM) and sampled at irregular intervals of 10–15 seconds. Calories data measures the amount of calories used, which is calculated using a combination of a person’s basal metabolic rate (BMR) and activity rate. We can change the sample frequency of the heart rate data so that both datasets record observations per minute. The value columns should also have the text removed in order to be read as floats:

#Cleaning the training data
heart = heart.sort_values(by="dateTime")
heart = heart.set_index('dateTime')
heart["value"] = heart["value"].apply(str)
heart["value"] = heart["value"].str.split("{'bpm':").str[1]
heart["value"] = heart["value"].str.split(",").str[0]
heart["value"] = heart["value"].astype(int)
heart = heart.resample('1Min').mean()
heart['value'] = heart['value'].round(0)
calories = calories.sort_values(by="dateTime")
calories["value"] = calories["value"].astype(float)
calories.columns = ['time', 'value']
calories = calories.set_index('time')

We can now merge heart rate and calories together to create a single DataFrame of time series data covering the sample period. Creating an empty DataFrame with a DateTimeIndex covering our sample period will make things more convenient:

#Merge the datasets together
empty = pd.DataFrame(pd.date_range(start='2021-03-01', end='2021-03-20', freq='1min'))
empty.columns = ['datetimeindex']
empty = empty.set_index('datetimeindex')
from functools import reduce
df = reduce(lambda left,right: pd.merge(left,right,left_index=True, right_index=True), [heart, calories, empty])
df.columns = ['Heart Rate (BPM)', 'Calories Used']
df = df.dropna()

Our training data will come from a consecutive sample period. The test data comes from a day approximately in the middle of our sample, therefore we can assume there is some continuity in conditions affecting heart rate and calories data:

#Splitting the test day
df_test = df.loc[(df.index > '2021-03-12 00:00:00') & (df.index <= '2021-03-12 23:59:59')]
df = df.loc[(df.index < '2021-03-12 00:00:00') | (df.index > '2021-03-12 23:59:59')]

We are now at a stage to finally visualise what the Fitbit data looks like, and will start with the training data. The gap in the training data represents the test day that we previously extracted:

#Visualising training and testing data
datasets = [df, df_test]
for dfs in datasets:
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(12,6))
axs[0].plot(dfs['Heart Rate (BPM)'])
axs[1].plot(dfs['Calories Used'])
plt.setp(axs[0].get_xticklabels(), visible=False)
axs[0].tick_params(axis='x', rotation=70)
axs[1].tick_params(axis='x', rotation=70)
axs[0].set(ylabel='Heart Rate (BPM)')
axs[1].set(ylabel= 'Calories Used', xlabel="Date")
plt.tight_layout()
plt.show

Finally, some graphs!

Training dataset of heart rate and calories used over time

The dataset show a cyclical trend in both heart rate and calories; both datasets increase in the morning, remain fairly stable during the day, then gently decrease in the evening. This is an expected trend as it correlates with waking up, working during the day, then relaxing in the evening before bed.

Moving onto the test dataset which will be used to validate whether anomalies are detected. This data covers a single day, where we can try to guess what would be a potential anomaly based on what we have seen in the training data:

Test dataset of heart rate and calories used over time

Interestingly the test data for this day has two peaks where heart rate increases significantly in the afternoon, which doesn’t correlate with an increase in calories. This would indicate that the increase in heart rate is likely not associated with a change in exercise. The added benefit of working with personal data is that I know what happened on this day; a trip to the hospital. As someone who isn’t a huge fan of hospital visits, my Fitbit has clearly captured my nervousness which makes it an excellent example of anomaly we will later try to identify.

We can make a simple correlation matrix between heart rate and calories to see their relationship:

#Make a correlation coefficient
print(df.corr())
Heart Rate (BPM) Calories Used
Heart Rate (BPM) 1.000000 0.102251
Calories Used 0.102251 1.000000

There is a slightly positive correlation between heart rate and calories used. According to Fitbit’s documentation on how calories are calculated, heart rate influences how calories are calculated therefore our observed positive correlation is to be expected.

Preprocessing:

Getting back to the code, we now need to scale our data so that we can reduce the impact of any outliers within our training data that may impede a model’s ability to detect anomalies. We will use the MinMaxScaler as this will change both the heart rate and calories data to be between 0 and 1, which will be important later when we calculate prediction errors.

#Scale the data
scaler = MinMaxScaler()
scaler = scaler.fit(df)
scale_train = pd.DataFrame(scaler.transform(df))
scale_test = pd.DataFrame(scaler.transform(df_test))

Now the data has been scaled, we need to reshape to be compatible with our autoencoder model. The input to our model will be a three dimensional array with a shape according to the number of samples, the length of the sequences, and the number of features. We will use a sequence of 30 time steps (aka half an hour time interval) as we assume that anomalies are short lived.

# reshape to [samples, time_steps, n_features]
def create_dataset(X, y, time_steps=1):
Xs, ys = [], []
for i in range(len(X) - time_steps):
v = X.iloc[i:(i + time_steps)].values
Xs.append(v)
ys.append(y.iloc[i + time_steps])
return np.array(Xs), np.array(ys)
time_steps = 30
X_train, y_train = create_dataset(scale_train, scale_train, time_steps)
X_test, y_test = create_dataset(scale_test, scale_test, time_steps)
print(X_train.shape, y_train.shape)

This produces a train and test set with a shape of (23333, 30, 2) (23333, 2).

Model building:

Our data has been successfully reshaped and is now ready for building our autoencoder. This article won’t go into the details of autoencoders and LSTMs, but I highly recommend checking this information out:

The autoencoder will be composed of LSTM layers split into an encoder and decoder. Firstly the model will encode the input sequence of data into a simplified representation of its key features, then a decoder will learn how to covert this simplified representation back into its original input state. Using an autoencoder approach allows us to create a representation of ‘normal’ features within a dataset. Therefore when an anomalous sequence is input, the model will be unable to reconstruct its simplified representation which will lead to a high model error.

As our input uses two features, the error from the combined prediction (i.e. the error from both heart rate and calories sequence reconstructions) will be used as our method for detecting anomalies.

Our model architecture uses multiple LSTM layers in the encoder and decoder, separated by a RepeatVector layer which repeats a simplified representation given by the last encoder LSTM layer. A final TimeDistributed later at the end of the decoder rebuilds the input back to its original shape.

#model building
model = keras.Sequential()
model.add(keras.layers.LSTM(256, input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, name='encoder_1'))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.LSTM(128, return_sequences=True, name='encoder_2'))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.LSTM(64, return_sequences=False, name='encoder_3'))
model.add(keras.layers.RepeatVector(n=X_train.shape[1], name='encoder_decoder'))
model.add(keras.layers.LSTM(64, return_sequences=True, name='decoder_1'))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.LSTM(128, return_sequences=True, name='decoder_2'))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.LSTM(256, return_sequences=True, name='decoder_3'))
model.add(keras.layers.TimeDistributed(keras.layers.Dense(units=X_train.shape[2])))
model.compile(loss='mae', optimizer='adam')
model.summary()
Model summary

The model will use the Adam optimiser and measure the mean average error ‘MAE’ loss. We will fit the model to our training data; note that training data is used for both the input and label when fitting our model. A batch size of 256 will be used for a training duration of 25 epochs.

#fitting on training data
history = model.fit(X_train, X_train, epochs=25, batch_size=256,
validation_split=0.1,verbose=1,shuffle=False)
#plotting loss
fig=plt.figure()
plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.ylabel('Loss')
plt.xlabel('No. Epochs')
plt.legend()
plt.show()
Training & validation loss during training

We will now use the MAE between the training data and predictions to show a distribution of losses, that will serve as our way to define an anomaly. MAE is calculated per time step from a combination of both heart rate and calories data.

#predicting on test data
X_pred = model.predict(X_train)
X_pred_2d = pd.DataFrame(X_pred[:,0,:]).astype(float)
X_pred_2d.columns = ['HR Pred', 'Calories Pred']
X_train_2d = pd.DataFrame(X_train[:,0,:]).astype(float)
X_train_2d.columns = ['HR Test', 'Calories Test']
#Plot the test data together
fig, axs = plt.subplots(4, figsize=(12,6))
axs[0].plot(X_pred_2d['HR Pred'])
axs[1].plot(X_train_2d['HR Test'])
axs[2].plot(X_pred_2d['Calories Pred'])
axs[3].plot(X_train_2d['Calories Test'])
plt.setp(axs[0].get_xticklabels(), visible=False)
plt.setp(axs[1].get_xticklabels(), visible=False)
plt.setp(axs[2].get_xticklabels(), visible=False)
axs[0].tick_params(axis='x', rotation=70)
axs[1].tick_params(axis='x', rotation=70)
axs[0].set(ylabel= 'HR Prediction')
axs[1].set(ylabel= 'HR Training')
axs[2].set(ylabel= 'Calories Prediction')
axs[3].set(ylabel= 'Calories Training', xlabel= 'Time Step (per minute)')
plt.tight_layout()
#calculate error
predictions = pd.concat([X_pred_2d['HR Pred'], X_pred_2d['Calories Pred']], axis = 1)
train_inputs = pd.concat([X_train_2d['HR Test'], X_train_2d['Calories Test']], axis = 1)anomaly = pd.DataFrame(np.abs(predictions.values - train_inputs.values))
anomaly = anomaly.mean(axis=1)
ax = sns.distplot(anomaly, bins=50, kde = True)
ax.set_title('Training Data Loss Distribution')
ax.set_xlabel('Loss')
ax.set_ylabel('Frequency')
fig = ax.get_figure()
Training data loss distribution

After looking at the loss distribution we can see a long tail exists with higher loss values, indicating times when the model has struggled to reconstruct the input sequence. To define the loss threshold which will be used to judge whether a prediction is anomalous, we will find the loss in the training data at the 99th percentile.

thres = round(numpy.quantile(anomaly, 0.99),3)
print('99th percentile loss value from training: ' + str(thres))
99th percentile loss value from training: 0.177

Predictions:

Now that we have built our model and defined a loss threshold of 0.177, we can move on to testing our model to see whether it can detect the anomalies we recognised earlier in the test data. We will reuse our model with the previously reshaped test data and repeat the same process for calculating the prediction loss.

#Predicting
X_pred = model.predict(X_test)
X_pred = pd.DataFrame(X_pred[:,0,:]).astype(float)
X_pred.columns = ['HR Pred', 'Calories Pred']
X_test_data = pd.DataFrame(X_test[:,0,:]).astype(float)
X_test_data.columns = ['HR Test', 'Calories Test']

We use the previously defined loss threshold as a cut-off for defining a data point as anomalous. To better visualise our anomalous points we combine the anomaly column with the original test dataset to help put things into context.

difference = pd.DataFrame(np.abs(X_pred.values - X_test_data.values))
difference['mae loss'] = difference.mean(axis=1)
difference['threshold'] = thres
difference['anomaly'] = difference['mae loss'] > difference['threshold']
difference['index'] = difference.index
X_pred['index'] = X_pred.index
X_test_data['index'] = X_test_data.index
X_test_data = X_test_data.join(difference['anomaly'])

The heart rate and calories data can be returned to their original scale using the previously made scaler, which will help us better understand our results:

X_test_data_original = pd.DataFrame(scaler.inverse_transform(X_test_data[['HR Test','Calories Test']]))
X_test_data = pd.concat([X_test_data, X_test_data_original], axis = 1)
X_test_data.columns = ['HR Test', 'Calories Test', 'Index', 'Anomaly', 'Heart Rate (BPM)', 'Calories Used']

Our final steps will be focused on visualising the prediction results. We will begin by plotting the average loss of each data point, colour coded by whether an anomaly was identified:

plt = sns.lmplot(x='index', y='mae loss', data=difference, 
fit_reg=False, hue='anomaly', scatter_kws={"s": 10}, legend=True, legend_out=False, height=5, aspect=2)
plt.set(xlabel='Time Steps (per minute)', ylabel='MAE Loss')
Predicted Mean Absolute Error over time

As we can see, our previously defined anomaly cut-off has been used to detect anomalous data points.

Looking at the distribution of model loss during the day in our test data, it is clear that the likelihood of anomalies occurring in the early morning are minimal since the loss on data points below 500 minutes (aka 08:30am are consistent low). This makes sense as I’m asleep and obviously not going to the gym in the middle of the night.

Things become slightly different however during the rest of the day where the distribution of loss becomes larger and some data points are flagged as anomalous. The larger distribution of loss in the afternoon (aka between 720–1140 minutes) can potentially be attributed to variations in my afternoon schedule.

The anomalous points are spread into multiple time intervals. To understand why these anomalous events occur, we can look at the heart rate and calories dataset:

plt = sns.lmplot(x ='Index', y='Heart Rate (BPM)', scatter_kws={"s": 10}, data=X_test_data, 
fit_reg=False, hue='Anomaly', legend=True, legend_out=False, height=5, aspect=2)
plt.set(xlabel='Time Steps (per minute)', ylabel='Heart Rate (BPM)')
Test dataset of heart rate coloured by detected anomalies

The heart rate data has been colour coded by anomalies detected in the previous loss plot. We can see that obvious anomalies are when my heart rate is anomalously high, reaching over 140BPM, forming two peaks between 800–1000 minutes. This fits nicely with the time of my aforementioned hospital visit.

Interestingly we can see that anomalies also occur during normal heart rates (based on the neighbouring heart rate values). Our model uses both heart rate and calories which suggests that these anomalies are likely associated with calories used. We need to look into the calories data to gain more information:

plt = sns.lmplot(x ='Index', y='Calories Used', scatter_kws={"s": 10}, data=X_test_data, 
fit_reg=False, hue='Anomaly', legend=True, legend_out=False, height=5, aspect=2)
plt.set(xlabel='Time Steps (per minute)', ylabel='Calories Used')
Test dataset of calories used coloured by detected anomalies

Our calories data shows that anomalies are detected during peaks, typically above six calories per minute, which are likely when I am walking. When we look at the anomalies associated with high calories we see they occur during times when heart rate is considered normal. This relationship, high calories but average heart rate, can be considered anomalous as typically most calories will be used when heart rate is higher (also supported by the positive correlation between heart rate and calories).

The inverse relationship is observed when heart rate is high as calorie data is low, suggesting that heart rate is high due to reasons not associated with physical activity. As this is my own data I know that in the afternoon I went to the hospital and, due to a fear of needles, my heart rate was very high. This is a great example of an anomaly and the reason why I selected this day to be used as our test dataset.

Conclusion:

We have learnt how to develop an autoencoder model using LSTM networks to detect anomalies in health time series data. We used our model to learn what was considered normal behaviour, and use this to detect potential anomalies. The model was tested on a day of anomalous data and correctly identified some interesting trends between heart rate and calories, associated with spontaneous walks and a fearful trip to the hospital!

This project hopefully highlights the fun of working on personal data, letting us interpret the model outcomes to a much higher level compared to some classic readily available datasets. Thanks for reading!

--

--