Convolutional Neural Networks for EEG Brain-Computer Interfaces

With code examples in PyTorch and TensorFlow

Published in

Towards Data Science

11 min readSep 13, 2022

Deep learning (DL) has seen an enormous increase in popularity in various fields. DL has been used for brain-computer interfaces (BCIs) with electroencephalography (EEG) as well. However, DL models needed to be adapted for EEG data. How has this been done, and how successful are DL approaches in the field of BCIs?

Figure 1. Photo by Josh Riemer on Unsplash

In this post, we first explain why DL can be advantageous compared to the traditional machine learning (ML) methods for BCIs. We explain how convolutional neural networks (CNNs) work, and how they have been altered and used for EEG data. We will dive into a specific network, EEGNET, and provide code examples of EEGNET in both PyTorch as TensorFlow. Lastly, we discuss how one may go about implementing deep transfer learning for EEG data.

This post is organized as follows:

Why use deep learning?
Specialized CNNs for EEG data
Explaining the most popular CNN for EEG data: EEGNET
Code examples EEGNET in TensorFlow and PyTorch
Is deep transfer learning possible?

Enjoy!

Why Deep Learning?

In the BCI field, two common approaches for classification are common spatial patterns (CSP) and Riemannian geometry (RG). Both methods are preceded by extensive pre-processing, with frequency filtering, an approach to extract features, as the most important step. CSP and RG are explained in more detail here, and frequency filtering is explained in more detail in this post.

The range of filters for the feature extraction step has to be manually chosen by the researcher, which introduces a subjective bias. Next to this, the frequency range with the most notable brain signals differs between subjects. As finding the optimal range for each subject manually can be pretty exhaustive, the researcher often chooses a general range (8–40 Hz for example), hoping this range is sufficient for all subjects.

This is where DL comes into play.

The proposed advantage of DL is that it is an end-to-end approach, where feature extraction and classification are combined in the model. DL models can extract features itself from raw EEG data, meaning data can be pre-processed without filters manually chosen by the researcher. Some studies use raw EEG data [1], while others only apply a band-pass filter in a broad range of 1–100 Hz to minimize the very low and high frequency noise artifacts [2].

Having this end-to-end approach takes away the subjective bias of the researcher for frequency filtering, with the DL model being able to learn the optimal range per individual subject.

Now we know the advantages of DL, let’s see how DL is used in the BCI field!

Convolutional Neural Networks for EEG

The most prominent example of DL in the BCI field is the application of convolutional neural networks (CNNs), which was originally intended for computer vision tasks like images, but also for audio signals.

Images and audio signals often have a hierarchical structure, where nearby features are important for the current feature, and far away features less so. When EEG data is seen as a 2D-array, with the number of time steps as the width, and the number of electrodes as the height (as seen in Figure 2), EEG data has similar characteristics to an image or audio signal. Data of nearby timepoints is important for the current datapoint, as well as data from the other channels at the same timepoint. Using convolutions and non-linearities, a CNN can learn local non-linear features and local patterns in these types of data.

Figure 2: An example of EEG data. EEG data can be seen as an 2D-array, with the rows being the electrode channels, and the columns the timepoints. Image by author.

A CNN works by using a kernel. A kernel is a sliding window over the data, scanning from left to right and from top to to bottom. For each scan, the dot product of the data in that window and the values of the kernel is calculated, essentially summarizing information of the data in that window. A visual example is given in Figure 3 below.

Figure 3: The sliding window of a CNN. Image from author, inspired by source.

Regarding CNNs for EEG data, the most popular models are developed with so-called temporal and spatial convolutions.

A temporal convolution has a kernel size of 1 x timepoints, where the sliding window will go over each channel with a certain timeframe, and therefore summarizes the EEG data over the timeframe for each channel.

A spatial convolution is applied over all channels, for each timepoint, and thus summarizes information over all channels. The convolutions can be applied multiple times with different kernel values, creating different types of summaries of the original data (called feature maps).

Figure 4: A temporal convolution and spatial convolution applied to EEG data. Image by author.

The goal of such convolutions was to represent the CSP pipeline by representing the frequency filtering with temporal convolutions, and spatial filtering with spatial convolutions.

One of the most popular DL models for EEG classification is EEGNET [1]. Famed by being a relatively big network while limiting the amount of parameters, EEGNET has been used a lot in recent studies.

Let’s explain EEGNET in detail, alongside code examples!

EEGNET

EEGNET consists of a temporal and spatial convolution, but also has another form of convolution, called a separable convolution. In the following sections, EEGNET is explained.

Please note that the original EEGNET differs a bit from our implementation explained here. For example, the authors of the original paper applied the model to EEG data of 64 electrode channels x 128 timepoints, while we used EEG data of 8 electrode channels x 500 timepoints. In general, it is recommended to play around and experiment with the kernel sizes and parameter values when applying the network to your own data.

The first layer of the network is a temporal convolution. The kernel size of the convolution was kept the same as the original EEGNET, at a size of 1 x 64. The amount of feature maps in this layer, named filter size (fz), was chosen based on hyperparameter search. Each convolution layer applied batch normalization after the convolution, to normalize the output from the previous layer to ensure a normalized input for the following layer.

The second layer is a spatial convolution of size 8 x 1. The first dimension is equal to the amount of electrode channels. The amount of feature maps from the previous layer was multiplied with a depth parameter (D), which was also chosen based on hyperparameter search. After applying batch normalization, a non-linearity was implemented with an exponential linear unit (ELU). ELU keeps output x the same when x > 0, and for x ≤ 0 the function exp(x) − 1 is applied.

Then, temporal average pooling with a kernel size of 5 x 1 with a stride of 5 was applied, averaging data from each 5 timepoints to reduce dimensionality. As the input size in our study (500) was not divisible by the stride value of 8 in the original EEGNET, we chose a stride value of 5.

After average pooling, a dropout layer followed. During training, dropout randomly zeroes some of the elements of the input with a certain probability pdrop. This prevents overfitting on training data by decreasing the dependency of specific nodes to nodes in earlier layers, as having high dependency between node pairs could lead to overfitting on specific features in the training data, which would not be present in the validation and test data. The value of pdrop was found by hyperparameter search.

Next, a separable convolution layer was applied, which consisted of a temporal convolution with kernel size 1 x 16 as used in the original EEGNET, directly followed by a 1 x 1 convolution over the kernels from previous layer grouped over all the feature maps, essentially summarizing the output of the temporal convolution over the feature maps. In this layer, another batch normalization and ELU was applied.

After, another 5 x 1 average pooling was applied, followed by a dropout layer. Lastly, data was flattened, and a linear layer was applied.

As in the original EEGNET, all convolution layers explained above were applied with a stride of 1, without adding bias. For temporal convolutions, ‘same’ padding was used, where zeros are added to the left and right of the input to have same output size after the convolution.

As optimization method, the Adam optimizer was used. The learning rate lr of Adam was also found by hyperparameter search.

EEGNET in PyTorch and TensorFlow

Now all the explanation is out of the way, let’s see how to code this model up!

The original authors have provided their model implementation in TensorFlow in their Github repository. Their implementation is in TensorFlow, and boils down to the following code:

Our implementation was developed in PyTorch, and comes down to:

Transfer learning?

EEGNET will most probably perform pretty bad when applying data of a single individual. Why? Simply because the amount of data is not enough for EEGNET to properly calibrate. For this reason, transfer learning (TL) has been used in the BCI field, where a model would learn from data of multiple subject before applying the model to a new subject. However, first experiments have showed that TL for EEG data comes with a lot of difficulties and problems. Let’s go over them.

Noise variability: EEG captures electrical activity originating from your brain. However, EEG electrodes are not smart devices. They cannot distinguish electrical signals originating from the brain from other electrical signals. And that can be a lot of signals. Think of your mobile phone, a monitor, air-conditioning, you name it. The difficult part here, is that these sources of noise will differ each day, and will be different in other environments. This variability of noise can cause differences between subjects and sessions from your own experiments. But imagine the differences between datasets, captured in totally different environments. Although this noise will only influence the EEG data by a small amount, it will still introduce difficulties when training a general model over multiple datasets, or data of multiple subjects [3].

The EEG device: Another source of variability is the EEG device itself. The main component here being the placement of the cap on the head of the subject. Although we have a standardized approach for electrode placement (the 10–20 international system), measuring the distances on the head of the subject and subsequent placement of the cap is a slightly subjective process. This causes small differences in cap placement across subjects or sessions, leading to differences in EEG signals, as electrodes may be further or closer away from the original signal originating from the brain. Again, this problem becomes bigger when comparing EEG data of multiple datasets. Other experiments may have used different EEG devices with different materials and slightly different cap placement..

Neural variability: The brain is a complex organ. Naturally, capturing the thought of a person may seem easy with EEG, but underlying mechanisms play a large role in differences between subjects, but also for day-to-day differences of the same subject.

It has been found that for same subjects, under same circumstances performing the same task, brain signals differ day-to-day [2]. This can cause problems in the DTL process, but also makes it difficult to use the same model for multiple days. Therefore, models should be fine-tuned or re-trained each day.

What makes DTL even more difficult, is the fact that some subjects do not exhibit a robust enough brain signal to be recognized as patterns by the DL models [4]. Having such subjects in your dataset can confuse the DL model during training.

Lastly, the way we think differs between persons, too. If we take motor imagery as example, one could perform this in two ways:

Kinesthetic: Imagining executing a motor tasks from first person view.
Visual: Imagining someone or yourself performing the motor task from a 3rd person view.

It has been found that the first approach results in clearer patterns in brain activity when compared to the latter [5]. If datasets contain a mixture of both approaches, this too can cause problems in the learning process for DTL.

An approach for EEG transfer learning: To wrap up the topic of transfer learning, a list of guidelines is presented to apply transfer learning for EEG data:

To ensure data is similar between subjects, the best thing to do is to collect the data yourself. During this data collection, you have control over the instructions to your subjects (for example, to instruct them to only perform kinesthetic motor imagery). Also, the same EEG device is used, the environment is as similar as possible, and you have control over cap placement.
To perform DTL, train your model on data of multiple subjects. For a specific subject or session, always collect a small amount of data to perform fine-tuning. Applying the general model directly to a new subject or session will most likely give bad results.
During the training process, have multiple subjects in your training set, but also 1 subject in your validation set. After each epoch, simulate the fine-tuning process by fine-tuning the current model on a small amount of data from this subject, and then get your validation accuracy by applying the model on the remaining data of that subject.

Conclusion

In this post, we covered:

How deep learning made its way to the BCI field due to the advantages of end-to-end learning
Why convolutional neural networks are the most popular type of deep learning models in the BCI field, and how they work
An in-depth explanation about EEGNET, the most popular model for EEG data
Then, we provided the code for EEGNET in TensorFlow and PyTorch
Lastly, we discussed the problems one may encounter when applying deep transfer learning, and how to overcome these problems with a specialized training process.

Thank you for reading. If interested, more details about all steps needed to build a BCI system can be found in my other post here.

References

Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou P Hung, and Brent J Lance. Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces. Journal of neural engineering, 15(5):056013, 2018.
Ce Zhang, Young-Keun Kim, Azim Eskandarian. EEG-Inception: An Accurate and Robust End-to-End Neural Network for EEG-based Motor Imagery Classification. Journal of Neural Engineering 18.4 (2021): 046014.
Yalda Shahriari, Theresa M Vaughan, LM McCane, Brendan Z Allison, Jonathan R Wolpaw, and Dean J Krusienski. An exploration of bci performance variations in people with amyotrophic lateral sclerosis using longitudinal eeg data. Journal of neural engineering, 16(5):056031, 2019.
Benjamin Blankertz, Claudia Sanelli, Sebastian Halder, E Hammer, Andrea Kübler, Klaus-Robert Müller, Gabriel Curio, and Thorsten Dickhaus. Predicting bci performance to study bci illiteracy. BMC Neurosci, 10(Suppl 1):P84, 2009
Christa Neuper, Reinhold Scherer, Miriam Reiner, and Gert Pfurtscheller. Imagery of motor actions: Differential effects of kinesthetic and visual–motor mode of imagery in single-trial eeg. Cognitive brain research, 25(3):668–677, 2005.