Photo by Donatello Trisolino from Pexels

Generating Titles for Kaggle Kernels with LSTM

Small Deep Learning Project with PyTorch

Alexandra Deis

Published in

Towards Data Science

5 min readAug 31, 2019

Introduction

When I first found out about sequence models, I was amazed by how easily we can apply them to a wide range of problems: text classification, text generation, music generation, machine translation, and others. In this article, I would like to focus on the step-by-step process of creating a model and won’t cover sequence models and LSTMs theory.

I got an idea to use Meta Kaggle dataset to train a model to generate new kernel titles for Kaggle. Kernels are the notebooks in R or Python published on Kaggle by the users. Kaggle users can upvote kernels. Depending on the number of the upvotes, kernels receive medals. Model, which generates kernel titles, can help to capture trends for Kaggle kernels and serve as an inspiration for writing new kernels and get medals. In this article:

I describe how to load and preprocess kernels data from Meta Kaggle dataset.
I demonstrate how to train a PyTorch LSTM model to generate new Kaggle titles and show the results.

The full code for this small project is available on GitHub, or you can play with the code on Kaggle.

Load Data

At first, I need to load the data. I am loading Kernels and KernelVersions tables, which contain information on all kernels, the total number of votes per kernel (later I explain why we need this) and kernel titles.

Make a List of Popular Kernel Titles

Next step is to make a list of most popular kernel titles, which should be then converted into word sequences and passed to the model. It comes out that kernel titles are extremely untidy: misspelled words, foreign words, special symbols or have poor names like `kernel678hggy`.

That is why:

I drop kernels without votes from the analysis. I assume that upvoted kernels should be of better quality and have more meaningful titles.
I sort kernels by the total number of votes and take only the most voted ones.

Preprocess Kernel Titles and Create a Vocabulary

I decided to try a word-based model. That’s why, in the next step, I need to create a vocabulary, which should be used to encode word sequences.

To create the vocabulary, I have to do the following steps:

Clean each title to remove punctuation and lowercase all the words.
Split each title to words and add each word to the vocabulary.
Introduce a symbol, which denotes the end of the title (I chose `.`, but it can be changed), and add it to the vocabulary.

Let’s introduce a simple function to clean kernel titles:

Now let’s introduce a symbol for the end of title and a word extraction function:

The next step is to make a vocabulary consisting of extracted words:

Prepare the Training Set

In this section, I create a training set for our future model:

Introduce functions which encode each word into tensor using the vocabulary created above. I use the one-hot encoding of words: each word is represented as a tensor with zeros and ones with all zeros and one in the position which respects to the index of the word in the vocabulary. Using word embeddings instead of one-hot encoding is undoubtedly an improvement to my approach.
Generate sequences out of kernel titles. The length of the sequence is a hyperparameter. I chose the sequence length equal to 3. So we give the model a tensor containing encoding for 3 words and a prediction target, which contains the index of the 4th following word.

Following functions encode words into tensors:

Now let’s generate word sequences out of titles of the most popular kernels:

Build the Sequence Model

The next step is building a simple LSTM model:

Input and output sizes of the model should be equal to the size of the vocabulary because we are trying to predict the next word for a sequence;
LSTM block with 128 hidden units;
One linear layer to translate from hidden size into the output size;
Using Softmax activation.

So let’s define and initialize a model with PyTorch:

Also I will need a utility function to convert the output of the model into a word:

Train the Model

Now the dataset and the model are ready for training. One more thing I need to do before the training is to introduce a function, which translates an index of the word in the vocabulary into tensor:

The next step is to set up hyperparameters and the device (CPU or GPU if available):

Define the model training procedure:

Now everything is ready for the training itself:

As a result of the training, we should see how the loss is decreasing over the number of epochs like this:

Training loss decreases over the number of epochs

Sample Kernel Titles from the Model

Here comes the most exciting part. Now we can use our trained model to generate new kernel titles! All we need to do is to write a simple sampling procedure:

Introduce the maximum number of words in the title (10 for example);
Pass zero tensors to the model as the initial word and hidden state;
Repeat following steps until the end of the title symbol is sampled or the number of maximum words in title exceeded:

Use the probabilities from the output of the model to get the next word for a sequence;
Pass sampled word as a next input for the model.

So let’s define the sampling function and sample some titles from the model:

Conclusion

In this small project:

I loaded and preprocessed real text data.
I created a word-based sequence model, which can be used to generate new kernel titles.

You can see that the model doesn’t generate something that makes sense, but there are still some funny results like these:

wealth bowl datamining
supplement approved databases
plane ignore population competition
projecting superfood prescribing survey
dinner lesson web screening
elasticnet playground

Such things happen when models crush into real-life data. They contain abbreviations, nicknames, words in different languages, misspelled words, and a lot more. Of course, you can improve these results by better data preprocessing. I described actions to improve the results below.

Further Improvement

Though I managed to get some exciting results, there is a lot what I could do to improve:

Better data cleaning: many titles should be removed from the analysis as they are not in English or they’re just can’t be used (for example ‘kernel123’).
Auto-correction of misspelled words: titles can be preprocessed with automatic correction of misspelled words (for example, consider PySpell package). This procedure takes ages to run. However, this is still an option since data preprocessing happens just one time before training.
Hyperparameter tuning: I suppose that learning rate and sequence length can be tuned to achieve even better results.
Use word embeddings instead of one-hot encoding for words.