Exploratory Sensor Data Analysis in Python

Mabel González Castellanos
Towards Data Science
10 min readJan 21, 2022

--

Photo by Chris Liverani on Unsplash

Exploratory Data Analysis (EDA) aims to expose the main characteristics of a dataset through statistical and visual tools. Commonly, this is the first step in approaching a problem and when it is adequately used, can contribute significantly to design a proper solution. The term EDA was coined in the book with the same name [1] written by the mathematician and statistician Jon Wilder Turkey in 77. He promoted and encouraged data analysis and his legacy has become more and more popular as time passed.

While there is a lot published about traditional EDA techniques, time series data brings special challenges when it is faced during analysis. More specific, sensor data is also time series data but with some peculiar characteristics which can be summarised as:

· Data is multidimensional, either the sensors have more than one channel or there are several sensors recording at the same time, sometimes both.

· Time series are long, the data is recorded for a certain period and frequency, determining the resulting number of data points.

· Several time series (files) form a dataset, that’s the case for motion sensors for example. In this domain, the number of time series depend on the number of recordings, normally related with the number of movements and persons performing them.

In the scenario described, EDA is not a straightforward process anymore and the goal of this post is to offer some practical steps to explore a sensor dataset. To illustrate the methodology proposed I am using a dataset [2] also available in UCI repository [3]. All relevant code and data used in this article are stored in this Github Repository (shortcut to Jupyter Notebook file).

Table of contents:

  1. Essential Visualisations
  2. Correlations
  3. Distribution Analysis
  4. Final Remarks
  5. References

1. Essential Visualisations

The dataset we will use contains the motion data of 14 people between 66 and 86 years old, performed broadly scripted activities using a battery-less, wearable sensor on top of their clothing. Data were collected in two clinical room settings (S1 and S2). The setting of S1 uses 4 RFID reader antennas around the room for data collection, whereas the room setting S2 uses 3 RFID reader antennas (two at ceiling level and one at wall level) for the collection of motion data.

The data is spread in 60 and 27 recordings from room 1 and 2, respectively. In this kind of scenario, it is recommended to keep all data in one data structure. I created a dictionary with the content of all files using the original file name as the key. Usually, the file name contains relevant information for the problem, in this specific application, the file name contains the room number and the gender of the volunteer. The time column is set as index of every data frame.

Now, it is time to create the first visualisations of the data. I start by counting the number of observations in every file contained in the dataset.

Key Insights — Bar plot (Figure 1):

Figure 1. Observations contained in every sensor file — Image created by Author
  • Bars represent the number of observations recorded in every file.
  • Recording durations range from a handful to thousands of observations.
  • It seems the volunteers didn’t follow a rigid timeline during the experiment.

Now, let’s take one file as example (“d2p01F”, collected in room 2 by a female volunteer) to recreate the activities recorded on it and the duration as well. An activity plot is used to understand the script followed during the recording.

Key Insights — Activity plot (Figure 2):

Figure 2. Activities performed during one recording, file “d2p01F” — Image created by Author
  • Most of the time the volunteer is lying in bed.
  • Lying state changes to sit and ambulate (lying -> sit on bed -> ambulate).
  • This sequence is performed two times during the recording.
  • The activities sit and ambulate only last a few seconds.

To see how those activities are reflected by the sensors let’s focus on the accelerometer and a traditional time series plot. The plot shows the values of the accelerometer in the three axes during the recording time. Vertical lines were added to mark the moments when an activity change occurs.

Key Insights — Time series plot (Figure 3):

Figure 3. Frontal, vertical and lateral acceleration values for file “d2p01F” — Image created by Author
  • The accelerometer values seem very sensitive to those changes in the three axes.
  • There are not values recorded by the accelerometer sensor between 100 and 200 seconds approximately.
  • Frontal axis shows more variations during the lying period in compare with the others.

The lag observed in this recording seems to be related with the sudden movement executed to get out of bed. To be aware of this kind of situation it is important to analyse the sampling rate. An easy way to do it is by analysing the differences between consecutive records. Normally most of the differences are around the same values but sometimes outliers could indicate anomalies during the recording.

Key Insights —Boxplot (Figure 4):

Figure 4. Time differences between every consecutive pair of observations, file “d2p01F” — Image created by Author
  • The box is almost a line which means most of the values are concentrated around the median (0.25).
  • There are some outliers (circles) close to this value except the one is more than 120 seconds apart.

Such extreme outlier could be considered a serious anomaly for the recording. This analysis is extended to the whole dataset.

Key Insights — Multiple Boxplot (Figure 5):

Figure 5. Boxplot of the time differences for all the files in the dataset — Image created by Author
  • The sampling rate is consistent in all the files.
  • Extreme outliers occur in most of them.
  • Highest differences are concentrated in the samples from room 2.

2. Correlations

Correlation analysis is a common topic to investigate during EDA and it is very useful when exploring sensor data. In this domain, finding some strong correlation between features is quite common. The results about this matter could guide some future decisions about the applicability of dimensionality reduction techniques.

We can measure the correlation at different levels. The file level is when we investigate the features relation within a file by analysing the multivariate time series recorded on it. By the other hand the time window level or rolling correlation explores the relation between two time series as a rolling window calculation. This section addresses both levels.

2.1 File Level

I continue analysing the file “d2p01F and compute the Pearson correlation pairwise between all features excluding the nominal ones like (id_antenna and label).The results are shown using a heatmap plot. The correlation values are also included in each cell.

Key Insights — Heatmap plot (Figure 6):

Figure 6. Pairwise correlation between all numerical features, file “d2p01F” — Image created by Author
  • There are not strong correlations in this example neither negative nor positive.
  • Pairs frontal-vertical and rssi-frequency show some negative correlation.

To get a general idea about the correlation in this dataset we need to extend the analysis to the rest of the files. Multiple heatmap plots are used to show the results.

Key Insights — Multiple Heatmap plots (Figure 7):

Figure 7. Pairwise correlation computed for all the files — Image created by Author
  • Most of the pairs are uncorrelated (values around 0).
  • The correlations vary between the files. For example pair frontal-vertical is uncorrelated (“d2p05F”), negative correlated (“d2p06F”) and positive correlated (“d1p12F”), depending on the file.
  • Those variations seem to be produced by specific gestures performed by the volunteers.

2.2 Rolling Time Window

To compute the rolling correlation, I use the built-in method corr in pandas library, specifying a rolling window of 50. The results for correlation between frontal and vertical in file “d2p01Fare shown using a heatmap plot of a single row.

Key Insights — Heatmap plot with single row (Figure 8):

Figure 8. Rolling a correlation window between frontal and vertical features in file “d2p01F — Image created by Author
  • There isn’t any valid number the first 49 observations (white coloured) because the window size is 50.
  • Most of the observations show negative correlations (yellow).
  • From observations between 250 and 300, the correlations are strong and positive (blue) due to person’s movement.

This is a good example to show how variable can be the correlations in sensor data domain. The rolling correlation reveals specific periods of time where the correlation can drastically change in compare with the correlation at file level.

3. Distribution Analysis

In order to get more insights about the sensor dataset it is important to know which distributions we are working with. In addition, many Machine Learning models were designed under the assumptions of specific distributions. To determine the distribution that better fits the data we use distfit package [4]. It tries out many well-known distributions and returns the one that better matches the data. There is also a plot function provided to visualize the results. The method is applied to feature lateral from the example file.

Key Insights —Distribution plot (Figure 9):

Figure 9. Fitting of lateral feature with a t-distribution, file “d2p01F — Image created by Author
  • The feature lateral matches very well with a t-distribution.
  • Values are concentrated mostly between -1.2 and -0.9.

We apply this process to all files and numerical attributes, obtaining a summary of the distribution types selected the most. Multiple bar plots are used to show a compact view per feature. The bars represent the number of times each distribution was selected across all the files.

Key Insights — Multiple bar plots (Figure 10):

Figure 10. Distribution summary by feature — Image created by Author
  • Dweibull-distribution is the most popular selection for attributes: frontal, vertical, rssi and phase, being the best match for dozen of files, respectively.
  • Other popular distribution among the features vertical and lateral is t-distribution.
  • Beta-distribution is the most selected option for frequency feature.

To understand better why so many data correspond with dweibull-distribution, we need to analyse more in detail specific examples. Let’s take one of those from attribute frontal and plot the distribution fitted.

Key Insights — Distribution plot (Figure 11):

Figure 11. Fitting of frontal feature with a t-distribution, file “d1p02M — Image created by Author
  • The distribution is not unimodal, several peaks are clearly visible in the empirical distribution.
  • Dweibull-distribution was selected as the best match but we can see there are big differences with respect to the empirical distribution.

To understand better what caused this shape, I use the kernel density estimation method to fit the data but this time splitting the distribution by class.

Key Insights — Kernel density estimation plot (Figure 12):

Figure 12. Kernel density estimation plot for frontal feature, file “d1p02M — Image created by Author
  • The peaks correspond with different classes and that explain the difficulties of fitting it with only one distribution.
  • The separation between classes is more noticeable for class lying (3) which has very different values in compare with the other 3 that imply movement.
  • This behaviour indicates that attribute frontal seems to discriminate well the class lying from the rest.

On the other hand, if the goal is to compare different files but considering all numeric variables, a more advanced plot is needed. We discuss a solution based on the plot proposed in [5]. Basically, the main advantage here is that the distributions from different series partially cover each other. This setup allows for multiple time series what is very convenient for time series with high dimensionality. The original code can be found here. Next figure shows the distributions for three files: “d2p01F”, “d2p02F” and “d2p03F”. All time series were z-normalized which centers the data around 0. The number at the right part of every pdf shows the x-value with the highest probability­­. It helps to compare different distributions for a common feature.

Key Insights — Zorder distribution plot (Figure 13):

Figure 13. Distributions of numerical features for three files: “d2p01F”, “d2p02F” and “d2p03F” — Image created by Author
  • Feature frequency looks similar across the different files, also the numbers [-0.9, -1.0, -0.9] are close to each other.
  • Feature rssi is clearly bimodal, but the location of the highest peak change in the case of the third file (“d2p03F”).
  • The feature vertical shows values of: -0.4, 0.6 and -0.2, which are different hence the shapes show dissimilarities what we can confirm visually.
  • On the opposite, the feature lateral shows values of: -0.4, -0.4 and -0.3. These are close to each other and the distribution shapes are not very different.

Final Remarks

Sensor data analysis is not a trivial task. Hope you have found some inspirations and ideas about how to handle this task. Some points to keep in mind:

  • Don’t rush to mix all the files, instead try to understand what have you got in first place.
  • Try every file as a mini dataset, understanding the data at that level will help to understand the whole picture later.
  • Don’t forget to analyse correlations. Even if you are confident about how data should behave, you may be in for some surprises.
  • Include data distribution in your analysis. It is one of the best ways to find out differences between the files, even in those cases were you are not expecting them to happen.

References

[1] Tukey, John W., Exploratory Data Analysis (1977), Pearson. ISBN 978–0201076165.

[2] Shinmoto Torres, R. L., Ranasinghe, D. C., Shi, Q., Sample, A. P., Sensor enabled wearable RFID technology for mitigating the risk of falls near beds (2013), In 2013 IEEE International Conference on RFID (pp. 191–198). IEEE.

[3] Dua, D. and Graff, C., UCI Machine Learning Repository (2019), Irvine, CA: University of California, School of Information and Computer Science.

[4] Taskesen. E., Distfit (2019), https://github.com/erdogant/distfit.

[5] Rougier, N. P., Scientific Visualization: Python + Matplotlib (2021), 978–2- 9579901–0–8.

--

--

PhD in Computer Science. Currently working as Data Scientist. I am passionate about solving real-life problems and extracting valuable information from data.