Identifying the Number of Open Ion Channels with Hidden Markov Models

A write-up on how we almost won the “University of Liverpool — Ion Switching” Kaggle competition

Gilles Vandewiele
Towards Data Science

--

EDIT 04/06/2020: We have now released our code!

On the 24th of February, Kaggle released a new “Research” competition, with a prize pool of $25,000, in collaboration with the University of Liverpool. In this competition, we were provided with electrical signals corresponding to ion channel data, and our goal was to create an algorithm that could automatically identify the number of channels that were open at each time point. Initially, the competition launched with an exotic “Quadratic Weighted Cohen Kappa Score (QWK)”, but this resulted in near-perfect results a few moments after release, which made me reluctant to join. After a while, the competition metric changed to a more reasonable macro F1-score, after which it was clear that we were still far away from a perfect solution. I decided to join because of several reasons:

  • It is timeseries data.
  • It is only one single timeseries, which allows us to focus more on pre-processing and modelling rather than feature engineering.
  • The dataset was rather small, with a total of 5 million measurements in the training set and 2 million measurements in the test set. The total storage needed was around 125 MB.

I had the honor to team up with two of the brightest people I ever had a chance of working with: Kha Vo and Zidmie. In the end, we managed to secure ourselves the third spot and $5,000 and came out on top of over 2600 teams. The number one and two managed to find a leak in the private test data towards the end of the competition that we did not discover (only right after knowing of its existence). One other way to re-phrase this, is that we had the best non-leaky solution! In the end, we were able to find this leak ourselves in a short amount of time, which makes it even more frustrating…

This blog post will provide a detailed overview of our solution, and provides the insights that were required for us to come up with our solution. Below is a table of contents:

Introduction

Problem statement

Ion channels are pore-forming membrane proteins, present in animals and plants, that allow ions to pass through the channel pore. They encode learning and memory, help fight infections, enable pain signals, and stimulate muscle contraction. These ion channels can be recorded through patch clamp techniques and analyzed to infer certain properties of diseases.

Schematic diagram of an ion channel from Wikipedia.

The organizers describe the challenge as follows: “When ion channels open, they pass electric currents. Existing methods of detecting these state changes are slow and laborious. Humans must supervise the analysis, which imparts considerable bias, in addition to being tedious. These difficulties limit the volume of ion channel current analysis that can be used in research. Scientists hope that technology could enable rapid automatic detection of ion channel current events in raw data. Technology to analyze electrical data in cells has not changed significantly over the past 20 years. If we better understand ion channel activity, the research could impact many areas related to cell health and migration. From human diseases to how climate change affects plants, faster detection of ion channels could greatly accelerate solutions to major world problems.”

We can clearly see an increase in our current when ion channels open.

A look at the data & competition metric

The data provided to us is rather simple, which is one of the nice things about this competition. The data contains only three columns:

  • A timestamp
  • Ampere values (current) for each timestamp. These values were measured at a sampling frequency of 10 kHz.
  • The corresponding number of open channels at that timestamp (only provided for the train data).

The training dataset consists of 5 million rows while the testing dataset consists of 2 million rows. The goal of the competition is to maximize the macro F1-score on those 2 million rows. The F1 score of a certain class is defined as the harmonic average of the precision and recall:

The “macro” F1 is simply the average F1 score per class. So if we would have K classes, we would calculate K F1 scores and then take the mean of those K scores. If we inspect the provided data, we can already conclude a few interesting things…

The provided original data (train and test)

First, by inspecting the open channels in the training data, we can see that the data consists of batches of 500K measurements. We can see these transitions in batches in the electrical signal as well. In the test set, we can find batches of 100K. Moreover, as it turns out, the batches of 500K in the training set actually consist of 5x 100K batches.

The training consists of multiple batches of 500K. Taken from “One Feature Model” by Chris Deotte.

By inspecting the open channels in the training set more, we can find five different groups in the training set:

  • A group with open channels either equal to 0 or 1. With many 0’s and a few 1’s. (category 1)
  • A group with open channels equal to 0 or 1. Many 1’s, few 0’s. (category 2)
  • A group of open channels equal to 0, 1, 2, 3. (category 3)
  • A group of open channels equal to 0, 1, 2, 3, 4, 5. (category 4)
  • A group of open channels equal to 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. (category 5)
  • Surprisingly, the test data contained a lot of data that looked very similar to category 1, but had some spikes in the electrical signal that do not appear in the train data. Hence, we classify these as a separate category (category 6)

When combining this insight with the competition metric, it becomes clear that the group with up to 10 open channels (category 5) will have the largest impact on the macro F1 average. It has 5 unique classes (6, 7, 8, 9, 10) that do not appear in other groups, as such, this group of data contributes for more than 50% to our macro F1 score. During the course of the competition, Waylon Wu discovered, by probing, that the public LB consisted of the first 30% (or 600000) test values, while the private LB consisted of the remaining 70% (1400000).

Pre-processing

Pre-processing I: removing drift

We noticed that weird drift is present in the beginning and in the end of our training signal, and in different places of the test signal… We notice two types of drift: curved drift (e.g. in the end of the training signal) and linear drift (e.g. in the start of the training signal, around time=50).

Two subsamples from the signal illustrating the two types of drift.

In the end, it turned out that the “linear” drift was not linear at all, but was actually just a small part of the “curved” drift (e.g. the first half of our curved drift looks linear as well). Shortly after the start of the competition, Eunho Lee and Markus F discovered that the competition organizers introduced artificial noise in the data, by using a sine function with a very low frequency (with a period of 1.000.000 values). Chris Deotte bundled the insights from the two independent analysis to perform the cleanest drift removal and make a dataset out of it that could be used by all other competitors (thanks Chris!). Our new data now looks as follows:

The data after drift removal

We can notice that something is still going wrong from time=360 until time=390. We decided to not bother about this, since such an anomaly was not present in the test set, and we removed this data from our training set.

Pre-processing II: linear transformation

By inspecting the data, it becomes clear that the train signal and number of open channels are extremely correlated. As a matter of fact, applying a linear transformation on the electrical signal and then rounding the resulting values already performs rather well.

The electrical signal and the number of open channels are heavily correlated.

We decided to learn a linear transformation per category of data that best maps the electrical signal values to the number of open channels, this allowed easier analysis in future steps. The transformation is rather simple, we learned two different parameters for the following function (1D Linear Regression):

1D Linear Regression to better align the electrical signal (x) with its respective number of open channels. o is the offset and s is the slope of our linear regression.

Our slope was the same for each category (0.8107) and the offset was dependent on the categories (2.221318 for categories 2, 3, 4 and 6; 4.4288541 for category 5 and 2.180783 for category 1)

The aligned data. Notice how the electrical signal now already corresponds very well the respective number of open channels. Rounding the signal values would result in a strong baseline already.

We provide a notebook with python code that illustrates this.

Pre-processing III: power line interference removal

As I mentioned in the previous section, rounding our aligned electrical signal values would already result in a strong baseline. Unfortunately, it would not result in a perfect score… That is due to the fact that noise is present in our dataset (luckily, or this would have been quite a trivial challenge)! We can inspect this noise by simply calculating the following:

Below is a plot of the first 10000 noise values of our training set:

The first 10000 noise values of our training set

While it may be hard to see with the eye, there is somewhat of a wave pattern present within our noise. This becomes even more clear if we take a moving average with a centered window of size 50:

A moving average (centered window of size 50) of our 10000 noise values.

We can clearly see a wave with a peak every 200 values on average. A periodic signal, measured at a sampling frequency of 10 kHz, with a period of around 200, corresponds to a frequency of 50 Hz. It turns out that the AC power in the UK (competition is by Uni of Liverpool) has a frequency of 50 Hz! In order to remove this pattern from our noise, we decided to fit a sine function on each batch of 100000 noise values (you can use scipy.optimize for that). While signal processing techniques, such as bandpass and bandstop filters can do this for you as well, we had better results with a simple sine function. It also turned out that having multiple components in the sine function (i.e. multiple sine waves with different amplitudes, phases and frequencies) worked the best.

10000 noise values and the corresponding fitted sine function.

Again, a notebook is provided to demonstrate this with python code.

Category 5 = Category 4 + Category 4

It seemed that our noise removal technique worked really well for all categories, except for one… There was something unique about the data in category 5 that distinguished it from all the rest. By doing some experimentation, we noticed that we could mimic data from category 5 very well by just adding two random batches from category 4 together.

Generating synthetic data of category 5 by adding two batches of category 4 together.

By making this assumption, we can also assume that the noise in cat 5 is the addition of two sine functions. Moreover, this insight allows for data augmentation which could be useful for, e.g., neural networks.

As it turns out, this insight was actually the leak as well. We did not pursue this avenue long enough, but the private test data was EXACTLY the sum of two category 4 batches, of which one can be found in the TRAINING set... After the competition, my teammate Zidmie, discovered (part of) this leak in roughly 30 minutes of looking for it (improving our private score to 0.95430). A bit frustrating to know that we improved our model much more in 30 minutes due to a leak than our work of 2 months. Later, he discovered the entire leak:y_test[570000:5800000] = predict(X_test[5700000:5800000] - X_train[4000000:4100000]) + y_train[4000000:4100000]. Since you are predicting category 4 data (with up to 5 channels) now, it becomes a much easier task.

Reproducing the leaks used by #1 and #2. #1 extracted X_train from X_test while #2 extracted y_train from X_test. Extracting X_train is better than y_train, because you also reduce the noise by half!

Modeling

Hidden Markov Model: 1-channel data

This meme was too good of a fit to not be using it. In the end, it turned out that a leak in the private test data was the actual winner (perhaps I should create a new meme…).

Sequential state space models, such as Hidden Markov Models, were the state-of-the-art for a long time for these types of problems. It is only more recently that deep learning variants have been proposed (e.g. this paper by the competition organizers), as these are often more robust against things such as drift. Since we are pre-processing our data to remove all artifacts, we decided to go for Hidden Markov Models are our modeling technique in this competition.

There are many great introductions to Hidden Markov Models out there, so I will not go much in detail here. All we need to know is that there are different parameters that belong to each Hidden Markov Model:

  • The number of hidden states (= K)
  • Initialization probabilities (vector of size K)
  • A transition matrix (K x K matrix)
  • Emission probabilities. Each hidden state emits a certain observable value. These emissions can have different distributions such as a multi-nomial for discrete emissions and a Gaussian for continuous emissions (which we are going to use here, since our electrical signal is continuous).

In short, the open channels (hidden variables) follow a certain Markov Process and transit from one value to the other with certain probabilities that are determined by the transition matrix. When the Markov Process is in a certain hidden/latent state, it emits an observable value, which is the electrical signal here.

As an example, we are going to focus on category 2 which has a binary response variable (open channels = 0 or 1), and assume our Markov Model to have 2 hidden states. A hidden state for when the number of open channels is equal to 0 and one hidden state for when it is equal to 1. We can estimate both the transition matrix & the emission distributions easily by looking at our training data (i.e. count transitions from one state to the other and look at signal values corresponding to a certain number of open channels):

The estimated Gaussians and transition probabilities for category 2 of our training data.

This means that, for example, if the number of open channels equals to 1, the signal value will be around 1 as well (with a bit of variance). The next value will be equal to 1 with a probability of 93.6% or 0 with a probability of 6.4%. These parameters can be estimated by using the training data, as illustrated above, but also by applying the Baum-Welch algorithm. This algorithm is a special case of the Expectation-Maximization algorithm, which is an unsupervised algorithm (no cross-validation was needed which allowed faster iterations). In a nutshell (and simplistic way), it will make predictions based on its initial parameters and then use these predictions to update the parameters until convergence. Once the parameters are learned, we can perform inference by applying one of two algorithms:

Initial experiments already showed that the Forward-Backward algorithm worked slightly better (albeit slower) than the Viterbi. Note that these algorithms are also used in the Baum-Welch algorithm. A great library to apply all these algorithms in Python is hmmlearn.

Let’s fit a 2-state HMM on our pre-processed data of category 2:

Creating a Hidden Markov Model with 2 hidden states, we estimate the parameters of our emission distributions and the transition matrix based on our training data. The “lengths=[100000]*10” comes from the fact that there are 2 batches of data of category 2. Each batch is 500K long and consists again of 5 batches of 100K. As such, we have 10 sequences of length 100000 in total.
The output of our script. The hmmlearn library performs the Baum-Welch algorithm for 5 iterations to adapt our parameters, after which it converges. The F1 score on this category of data is equal to 0.9961

So by using this rather simple approach, we can already achieve a F1 score of 0.9961 on category 2 of our data. It should be noted that categories 1 and 2 of our data are by far the easiest ones, with near-perfect performance. The F1 scores quickly drops in function of the maximum number of open channels. As an example, we cannot reach an F1 score of 0.9 on category 5 of our data. Please see this notebook for a more elaborate example.

“Hidden” Hidden Markov Model

When reading a paper published by the organizers, it becomes clear that while the number of open channels follow a Markovian Process, they are in turn generated by a Hidden Markov Model.:

The ground truth (open channels) is generated by a Markov Process, this can either by synthetic (which was the case for this competition) or real (as it is assumed that real ion channels do follow a Markov process). This ground truth was then played through patch clamp amplifier and recorded back, which generated our observable data with noise (the electrical signal). Figure taken from a paper by the organizers.

This Hidden Markov Model consists of more hidden states than the number of unique open channel values. This was pointed out during the competition by Markus F.

A “vanilla” HMM on the left, and a 2-layer or Hidden Hidden Markov Model on the right.

A simple way to approach this, is by ignoring the middle layer (y(t)) in our 2-layer model. As such, we have a Hidden Markovian Process with a number of hidden states larger than the number of unique open channels. Let’s apply this insight to our category 2 data. We will use 4 hidden states in our HMM (we have experimented with other numbers, but 4 worked best). The first two hidden states will correspond to when the number of open channels is equal to 1, the 2 following hidden states correspond to when the open channels is 0. Estimating the initial transition matrix now becomes more difficult as we cannot directly estimate it from our training data, we tuned the transition matrix manually.

In order to fit a 4-state HMM, we initialize the transition matrix manually and re-use our means and covariances for different hidden states.
The output of our script. Notice how the negative log-likelihood, which is being printed by hmmlearn during the iterations, is twice as large as the log-likelihood with 2 hidden states.

As can be noticed, we now achieve a F1 score of 0.9972, which is an increase of roughly 0.001 as opposed to using only 2 hidden states. This improvement is very significant, especially considering the fact that this improvement increases even more with higher numbers of open channels. A difference of 0.001 in F1 scores is the difference between a silver and gold medal in the end. The F1 can probably be improved even more by introducing more hidden states or by adjusting the initial transition matrix, but a smaller transition matrix had several advantages, which will be more clear later. Python code for this can be found in this notebook. For more information on how we manually tuned the transition matrix, we refer the reader through to this notebook.

Extending Hidden Markov Models to K channels

It seems that we can very accurately model category 1 and 2 of our data, which has up to 1 open channel. The question (of $25,000) is now whether we can apply this approach to the other categories of our data with more than 1 open channel. It turns out that this is far from a trivial case… Designing a large initial transition matrix and determining which hidden states correspond to which numbers of open channels is hard and expensive (our search space is infinitely large actually).

As such, we made the assumption that when the maximum number of open channels in a batch of data equals K that we are then dealing with K independent binary Markovian processes. Moreover, each of these Markovian processes have very similar parameters as category 2 of our data. Having several independent hidden processes resulting in an observable variable is a perfect fit for Factorial Hidden Markov Models. Unfortunately, we did not find a good library in Python (there’s factorialHMM, which is a great package, but we did not get it working quickly enough during the competition and there’s Pyro.ai but the learning curve is very steep (at least to me)).

With no implementations available, we realized that it was going to be a difficult and long process to implement this ourselves. We therefore decided to “hack” our way around the aspect of independent processes, by converting the smaller transition matrices, used to model data up to 1 open channels, into larger transition matrices. Each of the hidden states in our new large Markov process corresponds to the possible combinations of hidden states in the independent processes. The order does not matter (i.e. (0, 2) (which means that 1 process is in hidden state 0 and the other in hidden state 2) is the same as (2, 0)). This construction is done recursively as illustrated below:

In order to model 2 independent processes, we first extend our 4 hidden states to 16 hidden states that capture all possible combinations of hidden states. Afterwards we group similar states together (marked with same color). We could then continue further to model 3 processes by expanding our new 10x10 matrix into a 40x40 and reducing it back to a 20x20 (and so on). There are transitions between all states in the Markov Chain (I just did not draw all of them).
Expanding our 4x4 transition matrix to model independent processes.

In each iteration, we first expand our transition matrix. If the initial transition matrix is N x N, then our newly expanded matrix will be N*K x N*K which capture all possible interactions of our N initial states with K new possibly states. When there are 2 independent processes, an example would be that process 1 is in hidden state 0 and process 2 in hidden state 1. After this expansion, we group similar states together (such as the green cells in our figure). Using our example, process 1 being in hidden state 0 and process 2 in hidden state 1 is equivalent to process 1 being in hidden state 1 and process 2 in hidden state 0. Our newly generated matrix will be of dimensions C x C after this procedure, with C equal to the number of combinations with repetition, from P independent processes out of K possible states. As an example, with 4 hidden states and 3 independent processes, C would be equal to 20 (Comb_with_Repetition(4, 3) = Comb(4 + 3 – 1, 3) = 20).

Below is Python code to do this:

It should be noted that this algorithm was not that straight-forward to implement, and that a simple (but much more readable) implementation resulted in estimated runtimes of a few weeks when the number of processes was equal to 10. It already took this implementation around 2 minutes to generate the transition matrix with 10 processes with a 4x4 initial matrix, and already much longer for a 5x5 matrix (hence why we stayed with a 4x4).

Introducing memory in Hidden Markov Models

We implemented the forward-backward algorithm ourselves, which gaves us a lot of degrees of freedom. We tried many different adjustments to the vanilla implementation of the algorithm, but the one that resulted in the most significant improvement was the introduction of memory. When calculating the probabilities at timestep t in the backward pass, we use the calculated probabilities at timestep t of the forward pass:

Calculating the backward probabilities (betas). Psig is our emission probability, Ptran our transition probability, c a coefficient (the higher, the more impact the forward probabilities have).
Calculating our forward probabilities (alphas).

In total, we perform three passes. One forward pass, of which we use the alphas in our next backward pass and finally a last forward pass using our newly calculated betas. Our Psig is calculated as follows:

Calculating our Psig uses a Gaussian distribution with a mean of 0 and a sigma which can be tuned. This results in |States| probabilities.
Our adapted function, which was both used for the forward and backward pass. For the backward pass, we just transposed the transition matrix and reversed the signal.

A working example of this custom forward-backward algorithm can be found here.

Post-processing

Post-processing the posterior probabilities

After our forward backward algorith, we are left with a TxK with probabilities for each possible hidden state and each timestep. As we have multiple hidden states that correspond to the same value of open channels, we need another post-processing step to convert these probabilities into predictions. As a first step, we took the dot product of our K probabilities in each timestep with our K states, which is the same as a weighted sum and thus results in a continuous value Y. We then decided to learn thresholds for each class and each category of data. If Y falls between two thresholds, then it is assigned to the corresponding class.

To determine these thresholds, we used somewhat of an unsupervised technique. First, we filtered out the signals that were close to their rounded value (as not a lot of noise was present in these) and used their rounded values as labels. Then, we determined the distribution of these rounded values and picked the thresholds such that it would produce a similar distribution.

Please check out this notebook with our post-processing code.

A bit of blending

Many variants of this HMM have been made in the course of this competition. My teammate Zidmie used a counter in his scripts, and he was at 342 in the end (so 342 different scripts have been written by himself only). Throughout the course of the competition, we kept marginally improving both our CV and public LB score. Often, the number of differences between consecutive versions were a lot higher than the differences in CV scores, which means that not all of the changes made by the new version were actually correct. We therefore blended different outputs of our forward-backward algorithms together (by using their calculated probabilities). This gave us some marginal increases.

Concluding notes

I am extremely proud of our final result. I think the solution is rather elegant, given that it is not the typical ensemble of 1001 different models that you often see on Kaggle and that it runs in less than one hour. We also never used the code of the competition organizers to create extra data or to reverse engineer the data generation process. In the end, I was a bit disappointed to drop two spots due to a leak that was discovered by other teams and not by us. Nevertheless, I learned a whole lot during this competition as I never used a Hidden Markov Model before and got to know 2 great and bright people!

Keep an eye on this page, as it probably can/will receive minor updates over the next few days. Also, we are working towards releasing our code!

Thank you again Kha Vo and Zidmie a lot of fun during this competition!

Acknowledgements: the authors would like to thank Marco Gorelli for his constructive feedback.

Sources

--

--