The world’s leading publication for data science, AI, and ML professionals.

Self-Supervised Representation Learning from Wearable Data in Federated Setting

Smartphones, wearables, and Internet of Things (IoT) devices produce a wealth of data that cannot be accumulated in a centralized…


Photo by Jorge Ramirez on Unsplash
Photo by Jorge Ramirez on Unsplash

Smartphones, wearables, and Internet of Things (IoT) devices produce a wealth of data that cannot be accumulated in a centralized repository for learning supervised models due to privacy, bandwidth limitations, and the prohibitive cost of annotations. Federated Learning provides a compelling framework for learning models from decentralized data, but conventionally, it assumes the availability of labeled samples, whereas on-device data are generally either unlabeled or cannot be annotated readily through user interaction.

To address these issues, a self-supervised approach termed scalogram signal correspondence learning [1] based on wavelet transform is proposed, which learns useful representations from unlabeled sensor inputs, such as electroencephalography, blood volume pulse, accelerometer, and WiFi channel state information.

The proposed auxiliary task requires a deep temporal neural network to determine if a given pair of a signal and its complementary viewpoint (i.e., a scalogram generated with a wavelet transform) align with each other or not through optimizing a contrastive objective. The effectiveness of the method is demonstrated through learning representations from an unlabeled input and utilizing them to solve downstream tasks with training a linear classifier over pretrained network, usefulness in low-data regime, transfer learning, and cross-validation.

Self-Supervised Learning

The field of self-supervised learning exploits the natural supervision available within the input signal to define an auxiliary task that can enable the network to learn broadly-usable representations. In the past few years, several self-supervised methods have been developed for vision, audio, language modeling, and other domains.

The prominent approaches for learning from traditional input modalities include, colorization of grayscale images [2], predicting relative location of an image patch [3], audio-visual synchronization [4], temporal alignment in videos through cycle-consistency and robotic imitation learning via time-contrastive networks [5].

However, little to no attention is paid towards exploring other sensing modalities, such as electroencephalography, IMUs, and blood volume pulse. Here, we seek to learn representations from data produced by sensors (time-series) on edge as obtaining a large amount of such labeled data is time-consuming and extremely costly.

Federated Learning

Autonomous vehicles, Wearables, smartphones, and IoT sensors are examples of modern distributed devices producing a wealth of data every second. This massive amount of data offers an excellent opportunity for learning models to solve a diverse range of tasks. The applications of interest include customized fitness plans, personalized language models, and contextual awareness for driving automation.

The growing computational power of edge devices allows us to leave the data decentralized and push the network computation to the client, which is also ideal from a privacy aspect. The expanding area of federated learning [6, 7] explores developing methods to achieve the goal of learning from highly distributed and heterogeneous data through aggregating locally trained models on remote devices, such as smartphones and wearables.

Scalogram-Signal Correspondence Learning (SSCL)

Learning multi-sensor representations with deep networks requires a large amount of well-curated data, which is made difficult by the diversity of device types, environmental factors, inter-personal differences, privacy issues, and annotation cost. In [2] a self-supervised auxiliary task is proposed whose objective at a high level is to contrast or compare raw signals and their corresponding scalograms (which are a visual representation of the wavelet transform [8]) so that a network learns to discriminate between aligned and unaligned scalogram-signal pairs.

In the absence of the semantic labels, our methodology can be leveraged to generate an endless stream of labeled data. Therefore, it can train the network without any human involvement, which is particularly attractive for on-device learning.

Figure 1: Scalogram contrastive network [1].
Figure 1: Scalogram contrastive network [1].

The idea behind SSCL is to learn network parameters with a self-supervised objective that determines whether a raw signal and a scalogram correspond (or align) with each other or not. Given a multi-sensor dataset with fixed-length input segments of multiple modalities, we train a multimodal contrastive network to achieve the objective of synchronizing representations of the raw input with their corresponding scalogram with a contrastive loss function.

In the broadest sense, the SSCL task requires a semantic understanding of how time-frequency information presented in a scalogram relates to a raw input signal, thus enabling the model to learn general-purpose embedding with a complementary view on the original input. We give a high-level overview of our approach in Figure 1.

Given that SSCL does not require labeled data it is directly used in federated setting to learn representations from wearable data collected from multiple sensors. The t-SNE embedding learned with scalogram contrastive network on three different datasets are shown in Figure 2. Likewise, the cross-validation results based on user split of the datasets are presented in Table 1, where a linear classifier is trained on-top of pre-trained frozen network and compared with fully-supervised and autoencoder models.

To understand evaluation strategy and see a complete set of experiments in both centralized and federated setting, please see [1].

Table 1: Comparison of self-supervised representations to a fully-supervised network and pre-training with autoencoder using cross-validation.
Table 1: Comparison of self-supervised representations to a fully-supervised network and pre-training with autoencoder using cross-validation.
Figure 2: t-SNE embedding learned with scalogram contrastive network [1].
Figure 2: t-SNE embedding learned with scalogram contrastive network [1].

Conclusion

We reviewed a self-supervised method for learning representations from unlabeled multi-sensor data, which is typical in the IoT setting. The method utilizes wavelet transform to generate a complementary view of the input (i.e., a scalogram) to define an auxiliary task of scalogram-signal correspondence. This procedure is specifically designed to work in a federated learning setting to allow training networks with widely distributed and unannotated data as the labels can be readily extracted from the data without human-in-the-loop. The efficacy of the developed technique is shown on several publicly available datasets involving diverse sensory streams, such as electroencephalogram, blood volume pulse, and IMUs.

Follow me on Twitter | Github | Personal Website

References

[1] Saeed, Aaqib, et al. "Federated Self-Supervised Learning of Multi-Sensor Representations for Embedded Intelligence." IEEE Internet of Things Journal (2020).

[2] Larsson, Gustav, Michael Maire, and Gregory Shakhnarovich. "Colorization as a proxy task for visual understanding." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

[3] Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual representations by solving jigsaw puzzles." European Conference on Computer Vision. Springer, Cham, 2016.

[4] A. Owens and A. A. Efros, "Audio-visual scene analysis with selfsupervised multisensory features," in Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[5] Sermanet, Pierre, et al. "Time-contrastive networks: Self-supervised learning from video." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.

[6] McMahan, Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." Artificial Intelligence and Statistics. PMLR, 2017.

[7] Kairouz, Peter, et al. "Advances and open problems in federated learning." arXiv preprint arXiv:1912.04977 (2019).

[8] I. Daubechies, "The wavelet transform, time-frequency localization and signal analysis," IEEE transactions on information theory, vol. 36, no. 5, pp. 961–1005, 1990


Related Articles