The world’s leading publication for data science, AI, and ML professionals.

How I Won a National Level ML Competition with my Unique “Informal Approach”

Think like a Data Science Hacker – You don't need to play by the rules to be victorious

DEEP LEARNING

Source: Image by the author (made using Canva)
Source: Image by the author (made using Canva)

Last year (In 2019), a company (Can’t disclose their name!), conducted a National Level Machine Learning Competition for hiring purposes. The competition itself was hosted on HackerEarth.

On the page it said:

[We are] looking for passionate ML Developers with hands-on skills in Machine Learning and Analytical skills to join their dynamic team in Hyderabad.

Eligibility Criteria:

Years of Experience: 0 to 2 years

Skills: Machine Learning and Analytical skills

I was really excited about this 13-day competition because of my strong interest in the field of Machine learning and Deep Learning. ^_^!

So let’s dive into the details to see how I hacked my way through the problem and won the competition.


The Problem – "the mission"

Snippet from the Challenge Page :

MCQs are a widely-used question format that is used for general assessment on domain knowledge of candidates. Most of the MCQs are created as paragraph-based questions.

A paragraph or code snippet forms the base of such questions. These questions are created based on the three or four options from which one option is the correct answer. The other remaining options are called distractors which means that these options are nearest to the correct answer but are not correct.

You are provided with a training dataset of questions, answers, and distractors to build and train an NLP model. The test dataset contains questions and answers. You are required to use your NLP model to create up to three distractors for each question-answer combination that is provided in the test data.

Genre: NLP

Problem type: Contextual Semantic Similarity, Auto-generate Text-based answers

At first, the problem seemed to be simple. Inputs were clear, outputs were clear, and all I had to do is to build the model. Right?

Hmmmmmmm …. No

Here is the structure of the training dataset:

Source: Image by the author (made using Canva)
Source: Image by the author (made using Canva)

And the structure of the evaluation dataset:

Source: Image by the author (made using Canva)
Source: Image by the author (made using Canva)

As you can see, I had to do a lot of preprocessing and restructuring before I could even begin.

It was quintessential to restructure the given training and testing data according to the formulation of the problem.

Since every approach I could think of needed a different formulation of the dataset, there were multiple ways I could take?

You can find the entire source code of the solution to the problem here.

So let’s skip to the fun part. Shall we?


Data Preprocessing – Where it all begins

As the saying goes,

"Garbage In, Garbage Out"

It could not hold more true than it does in the case of Deep Learning and Machine Learning tasks like this. For the NLP domain, NLTK (Natural Language ToolKit) is the weapon of my choice.

I used it to remove unwanted symbols, punctuation, and stopwords for analysis and training purposes. Also, Tokenization, POS tagging, calculating Edit Distance, and Jaccard Similarity calculations could be done using NLTK.

After all the cleaning, the next step in data preprocessing was making vectors of the given question, answer, and distractors. And for this, I had to choose a word embedding.


The Word Embedding Dilemma

Traditionally, people tend to use one-hot vectors for their smaller datasets in the text generation task. But for medium to larger datasets, it is absolutely necessary to use embeddings of fixed dimensions to save memory and train the network efficiently.

I used GloVe (Global Vectors) and Word2Vec, both of _dimensionsize=300, for converting my token vectors, which will be passing through the Embedding Layer in neural architecture to embedded vectors.

These vectors will also be capturing the semantics of the sentence and context, unlike one-hot vectors.

Source: developers.google.com (redesigned using Canva)
Source: developers.google.com (redesigned using Canva)

The only problem here was, properly calculating loss and updating gradients.

This can be done by using _sparse_categorical_crossentropy_ as the loss function instead of _categoricalcrossentropy and to use _sparse_categorical_accuracy_ as the metric function.


Pickling and Saving Embeddings to Speed Up the process

Since it was pretty expensive to create the whole word embedding matrix every time the program runs, I had to make the corpus of words included in the training and testing dataset both, and develop an embedding matrix, once and for all.

After the creation of the _embeddingmatrix file using pickle, I could check its local existence and load it into the memory in a relatively and significantly shorter period of time.

Hence, speeding up the program.


Generating Sequences

Now just before feeding our Neural Network, We need our data to be in the form of sequences.

Consider a function f. For producing the output y, it needs n number of inputs. Let inputs be represented with x. So mathematically, we can write this as

y = f ( x1, x2, x3, . . . , xn )

So here, [x1, x2, x3, …, xn] is the input sequence of length n. and y is the output sequence of length 1. And f is the function determined by the neural net.

In my python implementation of sequences, this was achieved using Numpy arrays. Numpy is a package of powerful tools for mathematical operations. Numpy Arrays are what that goes into the neural networks, directly.

Here, the sequence length I used was n = 5.

In short, a sequence of 5 words, when fed into the network, predicts the 6th word.


My Formal Approaches

After grasping the uniqueness of the given problem, I tried several different approaches in the quest for the highest quality distractors. My definition for the perfect distractor was,

Distractor : A sentence in continuation of the question/phrase/stem which is highly similar, but not identical, to the answer/key.

Source: Image by the author (made using Canva)
Source: Image by the author (made using Canva)

My Top Five Approaches were:

Approach-1:

Idea: Using a 2-layer RNN-LSTM network with word2vec embedding trained on the given dataset (NOT pre-trained) using fixed-length token vectors before making sequences(of fixed length) out of them, to feed into the neural network.

Embedding Layer >> LSTM >> Dropout >> LSTM >> Dropout >> Dense

Flaw: The only flaw in this approach was, word2vec embedding vectors were not able to train on the given dataset, due to its semi-structured nature. Principle Component Analysis (PCA) and 2D Scattergraph plotting showed the cluster formation of unrelated/all words in the corpus.

Hence, I moved to Approach-2.

Accuracy: between 12% to 17% on evaluation criteria

Note: The Evaluation Criteria for the problem was based on Cosine Similarity. So this 12–17% accuracy was not that bad.


Approach-2:

Idea: Using a 2-layer RNN-LSTM network with GloVe pre-trained embedding vectors. Using variable length token vectors to make smaller fixed-length sequences and feeding it to the network.

Embedding Layer >> LSTM >> Dropout >> LSTM >> Dropout >> Dense

Flaw: This approach was way better but hit a roadblock when got stuck into local minima and was slower in getting the context.

Hence, I moved to Approach-3.

Accuracy: between 16% to 20% on evaluation criteria


Approach-3:

Idea: Using a 2-layer Bidirectional-LSTM layer with Glove pre-trained embedding vectors. Using variable length token vectors to make smaller fixed-length sequences and feeding it to the network.

Embedding Layer >> Bi-LSTM >> Dropout >> Bi-LSTM >> Dropout >> Dense

Flaw: This approach performed the best, but for predictions, the optimum temperature value was difficult to find.

Accuracy: above 17% on evaluation criteria


Approach-4:

Idea: Using LogisticRegression and Random Forest Classifier (LR-RF) cascading model for ranking the highest quality distractors from the distractor pool. A Distractor pool was made from the training dataset.

This approach was inspired by the research paper "Distractor Generation for Multiple Choice Questions Using Learning to Rank" by Chen Liang, Xiao Yang, Neisarg Dave, Drew Wham, Bart Pursel, and C. Lee Giles.

Accuracy: between 15% to 18% on evaluation criteria


Approach-5:

Idea: Using the Encoder-Decoder model with Luong Attention, previously used in the Neural Machine Translation (NMT), for generating distractor from the input of answer keys alone.

This approach is inspired by the paper "Attention is all you need" by the team of Google Brains et al.

Encoder(Embedding >> LSTM)

LuongAttention

Decoder(Embedding Layer >> LSTM >> LuongAttention)

Neural Architecture for Approach 5 is harder to describe here due to its non-sequential nature. Although, you can see it in the model.py file, where Encoder, Decoder, and LuongAttention are implemented as classes. Each of them contains the layers in call( ) function, defined within each class.


Note:

Defusing Gradient Explosions

This was a small hiccup I faced when I moved to LSTM and Bi-directional LSTM networks. To successfully prevent the gradient explosions, I reduced the batch size to half, 128, decreased the learning rate to 0.0001, used clip normalization = 1.0, and did clipping = 0.5 while updating the gradient values through Adam Optimizer.

I made several informal trials based on the insights I gathered from the dataset.

Uploaded results were the combination of approaches that scored the highest accuracy (3, 4, and 5).


Adopting a Highly Modular Program Structure

Planning is crucial. Before writing a single line of code, you must strategize your way to the end.

In this competition, I followed a mid-way approach between Object-Oriented and Procedural. I tried to develop the whole software in a form similar to that of an API to increase their reusability.

Planning the development took a bit more time in the beginning but was worth doing in the long run.

Give me six hours to chop down a tree and I will spend the first four sharpening the axe – Abraham Lincoln

Because of this modular approach, my workflow process was much faster and I could try many more things and approaches to this problem since I did not have to develop certain things from the ground up, again.

Note: This design scheme is implemented in Approach 1, 2 and 3 only. For Approach 4 and 5, more compact design scheme was chosen because of the lack of time and need for rapid development.

Details

My Software consisted of Four Main files, plus some small scripts for restructuring and preprocessing data and building corpus.

  1. One file **** at the top of the hierarchy, which imports the APIs and uses them as required.
  2. A file containing the higher-level abstraction of the workflow processes, like training, testing, comparing, and data loading tasks. (APIs)
  3. A file that contains the tasks related to data manipulation and file creations and deletions, such as data preprocessing, embedding matrix creation, sampling for making distinct predictions, and many other helper functions.
  4. A file containing the model, where all the magic happens. This file contained the neural network architecture implementation in Keras and all the hyperparameters in the form of a single function.
Source: Image by the author (made using Canva)
Source: Image by the author (made using Canva)

Among all the above-mentioned files, the API file was the one that is integrating all the components.

Regardless of this, all files were designed in a way that they could be used independently of each other, very easily.

Again, if you want a more detailed explanation of each file’s usage, go to my repository. There you’ll also find a blueprint of the entire program (for every approach).

I’m trying to cut down the boring part here!

Note: I used a GPU-enabled device because training and predictions can take too long on CPU-only devices. For Utilizing GPU, use CUDA software provided by nVidia for Nvidia Graphics.


The Informal Approach – The Winning Technique

In my opinion – no offense to the creators of the challenge – the Evaluation criteria for this problem was NOT well-suited to this particular task. That is what I exploited rather than attacking the problem head-on.

See the evaluation criteria for yourself …

Evaluation Criteria

For each question-answer combination, the distractors are converted to a counter vector of words and the cosine similarity is calculated between the predicted and actual value:

score = 100 * mean([cosine_similarity(text_to_vector(actual_value), text_to_vector(predicted_value))])

Since the Cosine Similarity of Count Vectors does not incorporate the semantic idea that the sentences are trying to convey, a better approach would be to use GloVe Vectors or any other Word Embeddings.

An Embedding that can truly capture the semantic meaning and then calculate the Cosine Similarity or something like this.

Loopholes

Along the way, I found some mathematical loopholes and exploited them to test my hypotheses (As I said earlier, I gathered insights from the database).

Let me tell you about the (hilarious) results that won the competition:

  1. 32.13% (Highest scored accuracy in the entire competition) on copying the same correct answer plus the top 6 most frequent words – "the, of, to, a, is, in". Can you believe it? I won the competition with this!
  2. 28.49% accuracy on just copying the same correct answer as the distractor 3 times.
  3. 20.01% accuracy by printing the top 8 most frequent words from the corpus, 3 times as distractor.
  4. 16.63% accuracy by printing the most frequent word, ‘the’ 6 times in a string, 3 times as distractor.

These approaches clearly depict the fact that the above-mentioned distractors have absolutely no chance of distracting the person who is appearing for the MCQ test.

But still, they have a higher accuracy score according to the so-called "Evaluation Criteria". Whereas, options that have significantly higher chances of confusing the person have lower scores on Evaluation Scale.

Simply put,

For a question such as "What is the color of the Apple?", a string of "the the the the the the" is more distracting/convincing than "The color of the apple is blue", according to the evaluation scale.

In short, Evaluation criteria loses its credibility at this point.


One Informal Approach that didn’t score well…

"Markov Chain Sentence generation" technique gave pretty good results too. But unfortunately, didn’t get a high enough score to be presented as a formal approach. It deserves some recognition and further development for the remarkable job it did. So it was worth mentioning.

The code for this approach is also included in the repo.


Did I get the Job? – Aftermath

Nope.

Although I won the competition with my highest accuracy score of 32.13%, on a National Level, the company told me that I didn’t have the "experience" to get the job (I’m a Fresher, 2020 pass out).

So here I am, writing about my experience, of the competition.

I hope you’ll learn something from this story, too.


Closing thoughts

On the evaluation criteria, my "informal" approaches did the best, whereas, in the real-world application of the problem, my formal approaches – (3,4 and 5) – won the race jointly.

As explained earlier, the evaluation criterion was biased towards the word frequency. It could not correctly calculate the contextual similarity and hence, my deep learning models in Approach-(1,2,3 and 5), which are scalable in terms of learning, remained undetermined.

Although, after reading many of the generated distractors by myself, I concluded that the results were up-to-the-mark.

All in all, I immensely enjoyed the challenge and learned a lot of new things along the way. I’ll be looking forward to some more challenging tasks in the future.

Thanks for Reading & Have a nice day!


If you enjoy reading these stories, then I’m sure you would love to be a Medium paying member. It’s only $5 per month, and you’ll get unlimited access to thousands and thousands of stories and writers. You can support me by signing up using this link, and I’ll earn a small commission that will help me grow and publish more stories like this.


Other articles that you may enjoy –

Medium API Documentation

10 Game-changing AI Breakthroughs Worth Knowing About

Why it’s Super Hard to be an ML Researcher or Developer?

Mathematics of Music [In Python]


Related Articles