Deduplication Deduplication

Yup, this is the problem that I want to help you solve. To remove those dirty little duplicates that cause harm, hinder the efficiency of certain tasks or even pollute our systems.

Abhijith C
Towards Data Science

--

deduplication
/diːˌdjuːplɪˈkeɪʃ(ə)n/

noun
the elimination of duplicate or redundant information, especially in computer data.

"deduplication removes the repetitive information before storing it"

As the definition says, the task we are trying to do is to remove the duplicate texts/sentences and so on. This is nothing but the act of checking how similar the text is to one another. They can be exactly identical like:
Deep Learning is Awesome! and Deep Learning is Awesome!.
Or, they could be quite similar to one another in terms of what the sentence tries to convey, like:
Deep Learning is Awesome! and Deep Learning is so cool.
We know that these two sentences convey the same thing, and this is what we want our machines to capture.

Such a task in literature is referred to as Semantic Text Similarity(STS). It deals with determining how similar two pieces of texts are. This would include not just the syntactic similarity, that is how similar or same are the words that are used in the two sentence, but also the semantic similarity that captures the similarity in what is being conveyed using the two sentences, i.e the meaning of the text plays an important role in determining what is similar and not similar.

The problem

The problem. Yes, that’s our main goal. To solve the problem. Let me give you an example. Say, you have to send really funny jokes (your joke could be a sentence or a bunch of sentences) to a group of people over e-mail (LOL!), and your boss asks you to make sure that people don’t receive same kind of jokes. So you have to make sure that all the jokes you have are unique and people don’t get bored with the content.
What a job, seriously?

You, being a kick-ass coder, decide to automate this task. You have this magical API that gives you a lot of jokes for free and you write a script to mail the jokes to that group of people your boss loves. But, we can’t really trust this magical API, can we? It’s magical. What if the API gives you similar jokes? You can’t risk upsetting your boss.
This is where you can use a deduplication engine that makes sure that no joke sent is similar to any that was sent in the past.

My primary aim here is not to talk a lot about these models. But to help you use them for a practical task, something like the one that is stated above. I admit that sending jokes to people in order to impress your boss isn’t really practical.

In the space of STS…

Let’s try to break this down into how this similarity measurement is defined and between what two entities we are trying to find the similarity (literally the text as is, or something else?).

Firstly, talking about the similarity measurement, there’s quite a few that can be used. Just for the sake of completeness, listing a few:
1. Jaccard Similarity
2. Cosine Similarity
3. Earth Mover Distance
4. Jensen-Shannon distance

But to cut to the chase, we’ll be using Cosine Similarity. Mathematically, Cosine similarity is a measure of similarity between two vectors (non-zero) of an inner product space that measures the cosine of the angle between them.

If two documents are similar and if they are far apart in the Euclidean space, they could still be pretty close to each other. This is captured by cosine distance and hence is advantageous.

Secondly, where do we use this cosine distance? Between pairs of sentence strings? Nope! This is were we use the power of Natural Language Processing and Deep learning. We use vectors.

A word/sentence vector is a row of real valued numbers (as opposed to dummy numbers) where each point captures a dimension of the word’s/sentence’s meaning and where semantically similar words/sentences have similar vectors.

And, again, there are plenty of methods to get these vectors. To name a few:
Word Embedding: word2vec, GloVe, BERT word embeddings, ELMo and so on.
Sentence Embedding: BERT sentence embedding, Universal Sentence Encoder, etc.

I’ll jump straight into the methods that I personally experimented with and that worked beautifully for me.

               word2vec + Universal Sentence Encoder

To prevent this from being a pure implementation-oriented article (which it is intended to be), I’ll try to explain what these models are, very briefly.

word2vec

word2vec comes in two variants: Skip-Gram and Continuous Bag of Words model (CBOW). There’s plenty of material on both these variants, if you are looking for a detailed explanation. I’ll be very crisp here. The skip-gram model is a bit slower but usually does a better job with infrequent words. Hence, this is what is used often. We’ll talk briefly about this.

You’ll find this diagram in almost every word2vec (Skip-Gram model) blog and tutorial. In this architecture, the model uses the current word to predict the surrounding window of context words. It weighs the nearby context words more heavily than the distant context words.

Here, we look at a window of context words (2 words on each side, in this case) and try to predict the center word.

Image result for skip gram model image
The Skip-gram model architecture (https://arxiv.org/pdf/1301.3781.pdf)

Consider w(t) is the input word, the usual dot product between the weight matrix and the input vector w(t) is done by the single hidden layer. We apply the softmax function to the dot product between the output vector of the hidden layer and the weight matrix. This gives us the probabilities of the words that appear in the context of w(t) at the current word location.

It’s the vectors that are present in the hidden layers that become the vector representation of that word. But these are ‘word’ embedding, and we have to find similar ‘sentences’. So, how do we get the vector representation of the sentence instead of just word embedding?

One simple and trivial way (the one that I will show today) is by simply averaging the word embedding of all the words of that sentence. Simple, isn’t it?

Now for the main part, let’s code this in.

w2vmodel = gensim.models.KeyedVectors.load_word2vec_format(
'models/GoogleNews-vectors-negative300.bin.gz'), binary=True)
def sent2vec(s):
'''
Finding word2vec vector representation of sentences @param s : sentence
'''
words = str(s).lower()
words = word_tokenize(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]

featureVec = np.zeros((300,), dtype="float32")
nwords = 0

for w in words:
try:
nwords = nwords + 1
featureVec = np.add(featureVec, w2vmodel[w])
except:
continue
# averaging
if nwords > 0:
featureVec = np.divide(featureVec, nwords)
return featureVec
def get_w2v_vectors(list_text1, list_text2):
‘’’
Computing the word2vec vector representation of list of sentences
@param list_text1 : first list of sentences
@param list_text2 : second list of sentences
‘’’
print(“Computing first vectors…”)
text1_vectors = np.zeros((len(list_text1), 300))
for i, q in tqdm(enumerate(list_text1)):
text1_vectors[i, :] = sent2vec(q)
text2_vectors = np.zeros((len(list_text2), 300))
for i, q in tqdm(enumerate(list_text2)):
text2_vectors[i, :] = sent2vec(q)
return text1_vectors, text2_vectors

That’s it! 🤷‍♂ You have your sentence embedding using word2vec.

Universal Sentence Encoder

Google presented a series of models for encoding sentences into vectors. The authors have specifically targeted this for downstream tasks, i.e for transfer learning tasks. STS is one such task.

It comes in two variants:
1. One with a Transformer Encoder
2. One with a Deep Averaging Network

Each of them have different design goals:
1. Targets high accuracy at the cost of greater model complexity and resource consumption.
2. Targets efficient inference with slightly reduced accuracy.

I’ve hyperlinked both these architectures with my favorite blogs that explain each of them very well. Let’s focus more on how to implement them.

usemodel = hub.Module('models/sentence_encoder')def get_use_vectors(list_text1, list_text2):
'''
Computing the USE vector representation of list of sentences
@param list_text1 : first list of sentences
@param list_text2 : second list of sentences
'''
print("Computing second vectors...")
messages1 = list_text1
messages2 = list_text2
num_batches = math.ceil(len(messages1) / BATCH_SIZE) # Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)
message_embeddings1 = []
message_embeddings2 = []
with tf.Session() as session:
session.run([tf.global_variables_initializer(),
tf.tables_initializer()])
for batch in range(num_batches):
print(batch * BATCH_SIZE, batch *
BATCH_SIZE + BATCH_SIZE)
batch_msgs1 = messages1[batch * BATCH_SIZE: batch *
BATCH_SIZE + BATCH_SIZE]
batch_msgs2 = messages2[batch * BATCH_SIZE: batch *
BATCH_SIZE + BATCH_SIZE]
message_embeddings1_temp, message_embeddings2_temp = session.run([usemodel(batch_msgs1), usemodel(batch_msgs2)])

message_embeddings1.append(message_embeddings1_temp)
message_embeddings2.append(message_embeddings2_temp)
all_embedding1 = np.concatenate(tuple(message_embeddings1))
all_embedding2 = np.concatenate(tuple(message_embeddings2))
return all_embedding1, all_embedding2

Again, that’s it! 🤷‍♂

Now we have the sentence embedding for your jokes from two different models.

Now we need the cosine similarities!

def cosine_similarity(list_vec1, list_vec2):
'''
Computing the cosine similarity between two vector representation
@param list_text1 : first list of sentences
@param list_text2 : second list of sentences
'''
cosine_dist = [cosine(x, y) for (x, y) in zip(np.nan_to_num(list_vec1), np.nan_to_num(list_vec2))]
cosine_sim = [(1 - dist) for dist in cosine_dist] return cosine_sim

When I was doing this job before you, I had a bunch of jokes that were non-duplicates and few that were duplicates. Specifically, I had 385K non-duplicate pairs and 10K duplicate pairs. I plotted an AUC-ROC for this task using only the word2vec model.

AUC-ROC

Nice! The curve looks real nice. (I’m omitting the confusion matrices on purpose).

TPR/Recall/Sensitivity: 77%
FPR: 2.2%

Let’s see how the Universal Sentence Encoder fared.

AUC-ROC

The area under the curve is slightly better, isn’t it?

TPR/Recall/Sensitivity: 77%
FPR: 2.2%

Let’s see what happens when we combine them. By combine what I mean is average the cosine similarities from both the approaches and check the metrics. “An averaging ensemble of the models”.

Best one so far!

TPR/Recall/Sensitivity: 78.2%
FPR: 1.5%

Wohooo! 🎉🎉 We have a clear winner!

This is one quick way of finding duplicates. No training involved, just downloading existing brilliant models and using them for your STS task!

With this piece in place, you can easily build your deduplication engine. Just store all the jokes that you send to your boss’s group on the first day and then for every new incoming joke, pair them up with all the previously seen jokes and use this ensemble model to find whether they are duplicates or not. If they are, throw them away.

And in this way, keep your boss’s friends happy, boss happy and get a nice paycheck and keep yourself happy :)

--

--