Check out what I learned and how the top solution for Kaggle’s challenging 25,000$ Covid competition works

If you are genuinely interested in machine learning, you must have heard about the amazing data science competitions website, kaggle.com. Last month, Stanford was hosting an urgent competition about predicting the degradation of the new mRNA Covid vaccine. To back up a little, the new Covid vaccine is dependent on messenger RNA molecules instead of DNA. This is mainly due to the fact that the Covid vaccine needs to be developed and implemented quite rapidly.
The data provided by Stanford was Tabular data that mainly contains the RNA sequences, their length, reactivity and many other features. One of the main challenges of the competition was that it was much shorter compared to usual Kaggle competitions. Participants had to quickly develop robust solutions in 1 month using only about 3,000 RNA sequences. As a beginner in machine learning, I only managed to be in the top 37%.
The solution:
Before diving into the details of this solution there are 2 important points to note. I recently started embracing the fact that note all of the machine learning solutions are model-centric, some of them (especially on Kaggle) are data-centric. This means that most of the solution’s value comes from data engineering tricks rather than a complex and powerful model. The solution that I am going to explain here is data-centric and is actually built on top of several public kernels from the competition (which is one of the best things about Kaggle!).
I won’t be involving the tricks or details that I think are quite specific to the competition as I want most of the audience to be able to apply these tricks to their ML projects.
Here are the most significant and relevant aspects of the solution:
- Attention
- Graph Neural Networks
- Pre-processing autoencoders
- Pseudo-labelling
- Data augmentation
Attention:
Attention is probably one of the hottest topics in NLP at the moment.
A neural network is considered to be an effort to mimic human brain actions in a simplified manner. Attention Mechanism is also an attempt to implement the same action of selectively concentrating on a few relevant things, while ignoring others in deep neural networks.
Source: Analytics Vidyha
I am not going to explain how it works since this has been explained in a lot of great articles and my point here is to show you that it is used to win data science competitions. If you want to learn more I suggest reading this:
Intuitive Understanding of Attention Mechanism in Deep Learning
Graph Neural Networks:
This was one of my favorite things about this competition, so many different approaches being used ! I didn’t fully understand them at the time, but I think I understand them a bit better now.
In Computer Science, graph theory is about modelling a problem into a set of vertices and nodes, a very common example is the Travelling Salesman Problem. This modelling technique then allows you to use tons of graph traversal algorithms that can be quite useful. To further elaborate, in this competition many successful solutions were modelling the structure of the mRNA vaccine structures as graphs and using Neural Networks to extract meaningful features from them.
Autoencoders can be magical sometimes:
Autoencoders are neural networks that aid in feature engineering by compressing datasets into a latent representation, effectively performing dimensionality. I have already used Autoencoders heavily in ML projects, but it was great to see them being used in this competition to achieve high scores.
Essentially, a lot of people were pre-training their data on Autoencoders first before passing them to RNNs, this allowed the RNNs to perform a lot better since they were processing fewer dimensions.

Data extrapolation tricks: Pseudo labelling and data augmentation
The main point of these 2 techniques is to virtually increase the size and variation of the dataset. I have been reviewing a lot of top Kaggle competition solutions lately and I can safely tell you that almost all of them are using data augmentation. Common data augmentation tricks include
- Random Flips (Horizontal & Vertical)
- Random brightness
- Random noise (such as gaussian)
- Inverting the data points
- Perturbation (which was used in the solution)
- and many more..
Perturbation was quite new to me and I think its very interesting. This quote sums it up quite nicely.
The algorithm to conjure up this deceptive distribution is very simple. Take an existing dataset, perturb it slightly, and continue to maintain specific statistical properties. This is done by randomly selecting a point, add a small perturbation and then validating if the statistics are within targeted bounds.
Source: Medium
If you want to learn about the details, check out this article:
As an unsupervised learning fan, I was quite fascinated to see pseudo labelling ( which is a semi-supervised learning trick ) being used in a top solution. Pseudo labelling is about extending the labels for labelled data points to unlabelled data points. Of course if this isn’t done properly, you will have data points with incorrect labels.
This is usually done by training the network with labeled and unlabelled data simultaneously in each batch. I also found this quote by the solution’s author to be quite powerful:
Personally, I think using pseudo labelled (PL) random dataset, together with different/amazing architectures contributed by other people, have potential of huge improvements (if we still need better predictions to fight COVID now). The reason is, with PL random dataset we now have access to unlimited "data" – labeled with blending of forecasts from different architectures (better than any single model).
Source: Jiayang Gao on Kaggle
Finally, here are the main lessons that I have learned from this competition:
1. Feature engineering is key
As an enthusiastic beginner in Machine Learning, I was very eager to start testing and prototyping quickly with complex Neural Networks. However, I now understand that you need to thoroughly examine, understand and analyze the data that you are given if you wish to optimize the given metric. Actually, the top competitors were the ones who had the best feature engineering solutions rather than complicated models.
To rewind for a second here, let’s explain what typically happens in a machine learning project:
- First you are either given a dataset or you are required to gather and label a dataset.
- You now need to carefully investigate which of the features or attributes in this dataset are the best to reach your required goal or target.
- Then, you have to make an educated decision on the best machine learning model that can make sense of this data and perform the required analysis such as regression or classification.
It is very important that you take your time with the second step although it can be tempting to just jump to the third one. For instance, although the 1st solution in this competition was using a fairly standard model, they were using data augmentation and pseudo-labelling which are parts of that second-step.
2. What Recurrent Neural Networks (RNNs) and LSTMs are
The main machine learning task in this competition was text regression, which is quite a new area for me. One of the most standard and efficient machine learning models in this subdomain are RNNs.
Although may not be obvious, the selling point for RNNs over normal NNs is data persistence. This means that RNNs basically have memory units or hidden state cells that allow them to "remember" the data, which in text analysis happens to be a significant advantage. This is simply because usually in text the different data batches (or windows) are dependent on one another, unlike for example classifying batches of images of cats and dogs. This allows the text to be analyzed within its "original context". Basic NNs do still remember the data of course but only the data within the same epoch/batch, however, RNNs remember the data from the prior input while generating the current output.

Long Short Term Memory or LSTMs are an upgrade of basic RNNs. They have special gates that enhance the data persistence capabilities. These gates have additional parameters that can be learned by the network to filter the data along the line, essentially discard useless features and keep the most relevant ones. This is a bit similar to how our memory works, when we read a paragraph, we often remember keywords instead of the whole paragraph. In the computer world, this is done simply through multiplying the irrelevant word vectors by 0 (to get rid of them), 1 (to persist them) or a value in between. To sum up, I have found that a significant part of the learning that you do when participating in data science competitions is after the competition actually ends and you look at other people’s solutions. I guess the main point of this story is to share this reflection and learning experience. If you keep going through similar experiences, such as coding competitions, and you never reflect on them, you are most likely missing out on a lot of learning!
If you want to receive regular paper reviews about the latest papers in AI & Machine learning, add your email here & Subscribe!