Spam Filtering System With Deep Learning

And explore the power feature extraction of Word Embedding

Published in

Towards Data Science

9 min readFeb 23, 2019

Deep learning is getting very popular in many industry and many interesting problems can be solved by deep learning technology. In this article, I am showing you how to utilize a deep learning model to design a super effective spam filtering system.

Not long ago, I had written an article about Spam Filtering with Traditional Machine Learning algorithm.

In that article, I have covered from data exploration, data preprocessing , feature extraction to choosing the right scoring metrics for your algorithm. If you are interested, you can read the article here!

Today, I am going to focus more on the following two parts when building a spam filtering system:

Word Embedding
GRU + Bidirectional Deep Learning Model

What is word embedding?

Word Embedding is a representation of text data in vectorized format. So, for a word like ‘dog’, the word embedding will translate it into a vector which is of shape (1,x) and x is a value that you can configure.

Intuitively, you can see this vector as a way to describe the word. If the word is a vector of shape (1,300), it means there are 300 different features describing this word.

What are those features? To be honest, we don’t know. Deep learning will identify those features during the training process.

The final outcome of the training process is a mapping from the word to a meaningful word vector.

Sounds like a cool magic trick huh? What made this even cooler, is the cosine distance between two word vectors actually signify something important.

Words that have closer semantic meaning will have shorter cosine distance in the vector space.

For example, vector ‘man’ and vector ‘woman’ have a very close cosine distance between each other.

w2v.most_similar('man')# Output
# [('woman', 0.699),('person', 0.644),('him', 0.567)]# 'woman' has the highest similarity score with 'man'

Word Embedding with Pretrained Weights

There are some open source word embeddings that have been trained over a large number of text data and their weights are opened for public to download.

By using the open source word embedding, you can save the time from collecting text data. Moreover, you can also save the time and computing resources to generate the word embedding.

The downside of using pre-trained word embedding is that the word embedding may be trained over data sources from different domains. This may not always fit the use case that you are applying.

For instance, pre-trained word embedding that is trained over text data of scientific journal may not be a best fit for problems like, detection of malicious tweets. The benefits from this word embedding will not be that significant.

Different variants of pre-trained word embedding

Most of the open source word embeddings list out the sources they are being trained, so you need to choose carefully based on the problems that you are solving.

Below are some of the examples:

Glove

2. Wiki-News

3. Google News Vectors

Experiment with Word Embedding

In this tutorial, we are using glove word embedding. The format of glove embedding is a little bit different from what the python library, gensim, is expecting.

So after you download the word embedding from the glove official website, you need to do some simple transformation.

You can easily transform it to the format that is compatible with word2vec embedding by running the python scripts below.

Make sure you have the python gensim module installed.

python -m gensim.scripts.glove2word2vec -i glove.6B.300d.txt -o glove.6B.300d.word2vec.txt

After that you can load it using gensim library easily,

w2v = KeyedVectors.load_word2vec_format(
      'glove.6B.300d.word2vec.txt’,binary=False)

We can also perform some operations on this vector and get back some interesting results.

For example:

Example 1 : King- Man + Woman = Queen
Example 2: Madrid-Spain+France = Paris

Let’s see how we actually do that in the code and what interesting results we might get.

w2v.most_similar(['king','woman'],negative=['man'],topn=1)
# Output: [('queen', 0.6713277101516724)]w2v.most_similar(['madrid','france'],negative=['spain'],topn=1)
# Output: [('paris', 0.758114755153656)]

We can also ask the word embedding model to figure out which is the outlier given a list of words

w2v.doesnt_match("england china vietnam laos".split())
#Output: englandw2v.doesnt_match("pig dog cat tree".split())
#Output : treew2v.doesnt_match("fish shark cat whale".split())
#Output : cat

I have included the relevant code inside the notebook, you can try around with different inputs and see whether it aligns with your expectation.

Now I think you have witnessed the powerful feature extraction capabilities of word embedding.

In the next section, we are going to see how can we combine this pre-trained word embedding with the Embedding Layer in Keras so that we can use it during our training.

Embedding Layer

Keras is an awesome high level deep learning library that helps you to build deep learning model easily. It abstracts a lot of low level mathematical details and let you build your model in a very intuitive way.

Embedding Layer is one of the wrapper layer provided by Keras for us to train the word embedding easily.

First, we will need to make use of tokenizer to help us convert all the words to some kind of tokens/indexes.

In the tokenization layer, they maintain a dictionary that map a word to an index. For example, dog -> 0, cat->1, and so on.

max_feature = 50000tokenizer = Tokenizer(num_words=max_feature)
tokenizer.fit_on_texts(x_train)# Converting x_train to integer token, token is just an index number that can uniquely identify a particular wordx_train_features = np.array(tokenizer.texts_to_sequences(x_train))
x_test_features = np.array(tokenizer.texts_to_sequences(x_test))

Tokenization Layer to transfer text to simple token

Internal working of the tokenization layer

Embedding layer will internally maintain a lookup table, and the lookup table will map the index/token to a vector, and that vector is what represent the word in the higher dimensional space.

Full overview of the whole transformation

Embedding + Bidirectional + Gated Recurrent Unit (GRU)

GRU

GRU is a variant of LSTM architecture and a lot of articles has done a great job in explaining the theory behind.

For the sake of brevity, I would not do much explanation on the theory behind all these models.

I would recommend the blog written by Chris Olah on this topic. His blogpost has shown some nice graphics to explain the internal working of GRU.

Bidirectional

The idea of Bidirectional is simple yet powerful. What it does is having two LSTM network instead of one.

For the first LSTM network, it will feed in the input sequence as per normal. For the second LSTM network, it will reverse the input sequence and feed in to the LSTM network. The output of these two networks will be merged and then passed on to the next layer.

The intuition behind Bidirectional is that, for some of the sentences, the context information is at the end of the sentence. Without the context info, ambiguity might arise. For example:

1. Find me at the bank in that forest. (River bank)
2. Find me at the bank in the city center. (Financial Bank)

So reading the sentence in bi-direction helps the model to determine the exact meaning of the word.

I would recommend reading the original paper if you are interested in the theoretical knowledge.

Building the network

Now, let’s start building our network in Keras, we are going to involve components such as Embedding Layer, Bidirectional Layer and Gated Recurrent Units (GRU).

For the Embedding, there is two options that we can take:

Train the Embedding Layer from scratch
Use some pre-trained open source weight embedding.

In the end of this article we are going to compare the results produced by using Vanilla Word Embedding and Glove Word Embedding.

Vanilla Word Embedding

The code snippet below shows how to construct the embedding layer, Bidirectional and GRU easily in Keras.

inp = Input(shape=(max_len,))x = EmbeddingLayer(max_features,embed_size)(inp)x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)x = GlobalMaxPool1D()(x)x = Dense(16, activation="relu")(x)ix = Dropout(0.1)(x)x = Dense(1, activation="sigmoid")(x)model = Model(inputs=inp, outputs=x)model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())

Pre-trained Glove Word Embedding

Glove Word Embedding is open sourced by Stanford NLP Group. We need to download the word embedding first, you can find the source information at the official website.

They have released different versions of word embedding, and they are trained over different sources of data. Feel free to experiment more using other variants of word embedding listed there.

The code to transfer weight from glove to Keras embedding layer is rather lengthy, here I will only show a short code snippet to give you an idea of how the transfer was being done. I would share the notebook code at the end of this article.

embeddings_index = convert_glove_to_index('glove.6B.300d.txt')# Randomly initialize the embedding matrix
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))# Transferring weight
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
       embedding_matrix[i] = embedding_vector

Now the embedding matrix contains all the weight from the glove embedding, and transfer them to Keras layer is just one extra line of code.

You need to specify the weight of the Embedding Layer when you are constructing the model. The rest of the code are same as how you build the vanilla embedding layer.

x = Embedding(max_features, embed_size, weights = [embedding_matrix])(inp)

Performance of the model

After training a few iterations on the training set, let us compare the performance of the model in terms of accuracy, precision and recall.

If you are interested in why these metrics are chosen, you can read the previous article that I have written. I have included a detailed explanation over these different choices of performance metrics.

From the table above, you can clearly tell both of the LSTM model is doing way more better than the Naive Bayes Algorithm. The reasons can be due to:

Tfidf Vectorizer did not take into account of the ordering of the words in the sentence, and it is losing a great deal of information.
LSTM has been one of the greatest algorithm in sequence data ( text, speech, time-series data) in the recent years

Comparison of performance between two Embedding Models

Precision & Recall

I have only run the model for 20 epochs and did not do any further fine tuning.

The precision and recall of the two models does not differ by a very great margin. Some of the reason might be due to:

Glove embedding is trained over sources that are very different from the data that we have in this problem, so the benefits we gain from the pre-trained embedding is not significantly helpful
The text data in this spam filtering problem is not too complicated. The vanilla word embedding is a good enough model to capture the pattern.

Comparing the accuracy graph of these two models, you will discover something interesting:

The accuracy of the vanilla word embedding plateau and remain flat at the first few epochs, before starting to have a better performance.

Whereas, the accuracy of the glove word embedding has a smooth accuracy curve and improved gradually.

These observations can roughly give us some insight that the vanilla word embedding models are still learning and adjusting its weight at the earlier epoch of the training.

Hence pre-trained word embedding is still helpful in speeding up the learning of the model, you might not notice the speed advantage in this problem, but when you are training over a large enough text data, these speed differences will be quite significant.

Conclusion

In this article, I have shown how to build a spam filtering system using word embedding and LSTM models with Keras.

Word embedding is definitely a good feature extraction tool for text data and with LSTM model, we can build a spam filtering system with very decent performance.

You can find the code of this notebook in this Github repository or you can run it directly from Colab.