The Definitive Guide to BiDAF — part 4 of 4

Modeling and Output Layers in BiDAF — an Illustrated Guide with Minions

BiDAF is a popular machine learning model for Question and Answering tasks. This article explains the modeling layer of BiDAF with the help of some cute Minions.

Meraldo Antonio

Follow

Published in

Towards Data Science

11 min readSep 3, 2019

--

This article is the last in a series of four articles that aim to illustrate the working of Bi-Directional Attention Flow (BiDAF), a popular machine learning model for question and answering (Q&A).

To recap, BiDAF is an closed-domain, extractive Q&A model. This means that to be able to answer a Query, BiDAF needs to consult an accompanying text that contains the information needed to answer the Query. This accompanying text is called the Context. BiDAF works by extracting a substring of the Context that best answers the query — this is what what we refer to as the Answer to the Query. I intentionally capitalize the words Query, Context and Answer to signal that I am using them in their specialized technical capacities.

A quick summary of the three preceding articles are as follows:

Part 1 of the series provides a high-level overview of BiDAF.
Part 2 explains how BiDAF uses 3 embedding algorithms to get the vector representations of words in the Context and the Query.
Part 3 explores BiDAF’s attention mechanism that combines the information from the Context and the Query.

The output of the aforementioned attention step is a giant matrix called G. G is a 8d-by-T matrix that encodes the Query-aware representations of Context words. G is the input to the modeling layer, which will be the focus of this article.

What Have We Been Doing? What Does G Actually Represent? Minions to the Rescue!

Ok, so I know we’ve been through a lot of steps in the past three articles. It is extremely easy to get lost in the myriad of symbols and equations, especially considering that the choices of symbols in the BiDAF paper aren’t that “user friendly.” I mean, do you even remember what each of H, U, Ĥ and Ũ represents?

Hence, let’s now step back and try to get the intuition behind all these matrix operations we have been doing so far.

Practically, all the previous steps can be broken down to two collection of steps: the embedding steps and the attention steps. As I mentioned above, the result of all these steps is an 8d-by-T matrix called G.

An example of G can be seen below. Each column of G is an 8d-by-1 vector representation of a word in the Context.

An example of G. The length of the matrix, T, equals the number of words in the Context (9 in this example). Its height is 8d; d is a number that we preset in the word embedding and character embedding steps.

Let’s now play a little game that (hopefully) can help you understand all the mathematical mumbo-jumbo in the previously articles. Specifically, let’s think of the words in the Context as an ordered bunch of Minions.

Think of our Context as a bunch of Minions, with each Context word corresponding to one Minion

Each of our Minions has a brain in which he can store some information. Right now, our Minions’ brains are already pretty cramped. The current brain content of each Minion is equivalent to the 8d-by-1 column vector of the Context word that the Minion represents. Here I present the brain scan of the “Singapore” Minion:

The Minions’ brains haven’t always been this full! In fact, when they came into being, their brains were pretty much empty. Let’s now go back in time and think about what were the “lessons” that the Minions went through to acquire their current state of knowledge.

The first two lessons the Minions had were “Word Embedding” and “Character Embedding.” During these lessons, the Minions learned about their own identities. The teacher in the “Word Embedding” class, Prof. GloVe, teaches the Minions basic information about their identities. On the other hand, the “Character Embedding” class is an anatomy class in which the Minions gained an understanding of their body structure through repeated scans.

Minions in the “Character Embedding” class

Here is the brain scan of the “Singapore” Minion after these two lessons.

The “Singapore” Minion understands his identity after attending the “Word Embedding” and “Character Embedding” lessons

Right after, the Minions moved on and attended the “Contextual Embedding” lesson. This lesson is a conversational lesson during which the Minions had to talk to one another through a messenger app called bi-LS™. The bi-LS™-facilitated convo allows the Minions to learn each other’s identities —which they learned in the previous two lessons. Pretty neat, huh?

Two Minions having a fun conversation through bi-LS™, sharing information about themselves. Source: Giphy

I took another MRI scan of the “Singapore” Minion right after the “Contextual Embedding” class. As you can see, now our little guy knows a bit more stuff!

Now “Singapore” knows both his and his neighbors’ identities!

Our Minions were happily studying when suddenly a man barged into their school😱 It turns out that his name is Mr. Query and he is a journalist. Here he is:

The inquisitive Mr. Query. He has an urgent question —*”Where is Singapore situated”* — and he knows some of our Minions hold relevant information for this question.

Mr. Query urgently needs to collect some information for an article he is writing. Specifically, he wants to know “where is Singapore situated.” Mr. Query knows that some of our Minions hold this information in their brains.

Our Minions, helpful as they are, want to help Mr. Query out. To do so, they will need to select several members of their team to meet with Mr. Query and deliver the information he’s seeking. This bunch of Minions that have relevant information for Mr. Query and will be dispatched to him is known as the Answer Gang.

The Answer Gang, which collectively holds the answer to Mr. Query’s question. Only relevant Minions can join the Answer Gang!

Now, our Minions have a task to do — they need to collectively decide who should and shouldn’t join the Answer Gang. They need to be careful when doing so! If they leave out from the Answer Gang too many Minions that should’ve been included, Mr. Query won’t get all the information he needs. This situation is called Low Recall and Mr. Query hates that.

On the other hand, if too many unnecessary Minions join the Answer Gang, Mr. Query will be inundated with superfluous information. He calls such situation Low Precision and he doesn’t like that either! Mr. Query is known to have some anger management issues 👺 so it’s in our Minions’ best interest to supply Mr. Query with just the right amount of information.

So how do the Minions know which of them should join the Answer Gang?

The answer to this is by organizing several meet-up sessions that are collectively called “Attention.” During these sessions, each Minion gets to talk to Mr. Query separately and understand his needs. In other words, the Attention sessions allow the Minions to measure their importance to Mr. Query’s question.

This is the MRI scan of the “Singapore” Minion’s brain as he saunters away from the Attention sessions. This is equivalent to the first brain scan image I showed.

Singapore’s current brain content. He knows quite a bit— but he is still missing one thing!

As we can see, our Minions’ brain are now pretty full. With the Minions’ current state of knowledge, are they now in position to start choosing the members of the Answer Gang? Nope, not quite! They are still missing one key piece of information. Each of our Minions knows his own importance to Mr. Query. However, before they can make this important nomination, they will also need to be aware of everyone's relative importance to Mr. Query.

As you might’ve guessed, this implies that the Minions have to talk to each other for the second time! And now you know that this conversation is done through the bi-LS™ app.

The Minions during the **modeling step** meeting. Here, they talk to each other through bi-LS™ and share their relative importance to Mr. Query. Source: Free PNG Logo

This bi-LS™-facilitated conversation is also known as the ‘modeling step’ and is the focus of our current article. Let’s now learn this step in detail!

Step 10. Modeling Layer

Okay, let’s leave our Minions for a while and get back to symbols and equations, shall we? It’s not that complicated, I promise!

The modeling layer is relatively simple. It consists of two layers of bi-LSTM. As mentioned above, the input to the modeling layer is G. The first bi-LSTM layer converts G into 2d-by-T matrix called M1.

M1 then acts as an input to the second bi-LSTM layer, which converts it to another 2d-by-T matrix called M2.

The formation of M1 and M2 from G is illustrated below.

Step 10. In the modeling layer, G is passed through two bi-LSTM layers to form M1 and M2

M1 and M2 are yet another matrix representation of Context words. The difference between M1 and M2 and the previous representations of Context words are that M1 and M2 have embedded in them information about the entire Context paragraph as well as the Query.

In Minion-speak, this means that our Minions now have all the information they need to make the decision about who should be in the Answer Gang.

The “Singapore” guy now has all he needs to decide if he should join the Answer Gang.

Step 11. Output Layer

Okay, now we’ve reached the finale! Just one more step and then we’re done!

For each word in the Context, we have in our disposal two numeric vectors that encode the word’s relevance to the Query. That’s awesome! The very last thing we need is to convert these numeric vectors to two probability values so that we can compare the Query-relevance of all Context words. And this is exactly what the output layer does.

In the output layer, M1 and M2 are first vertically concatenated with G to form [G; M1] and [G; M2]. Both [G; M1] and [G; M2] have a dimension of 10d-by-T.

We then obtain p1, the probability distribution of the start index over the entire Context, by the following steps:

Similarly, we obtain p2, the probability distribution of the end index, by the following steps:

The steps to get p1 and p2 are depicted in the diagram below:

Step 11. The output layer, which converts M1 and M2 to two vector of probabilities, p1 and p2.

p1 and p2 are then used to find the best Answer span. The best Answer span is simply a substring of the Context with the highest span score. The span score, in turn, is simply the product of the p1 score of the first word in that span and the p2 score of the last word in the span. We then return the span with the highest span score as our Answer.

An example will make this clear. As you know, we are currently dealing with the following Query/Context pair:

Context: “Singapore is a small country located in Southeast Asia.” (T = 9)
Query: “Where is Singapore situated?” (J = 4)

After running this Query/Context pair through BiDAF, we obtain two probability vectors — p1 and p2 .

Each word in the Context is associated with one p1 value and one p2 value. p1 values indicate the probability of the words being the start word of the answer span. Below are the the p1 values for our example:

We see that the model thinks that the most probable start word for our Answer span is “Southeast.”

p2 values indicate the probability of the words being the last word of the Answer span. Below are the the p2 values for our example:

We see that our model is very sure, with almost 100% certainty, that the most probable end word for our Answer span is “Asia.”

If in the original Context the word with the highest p1 comes before the word with the highest p2, then we have our best Answer span already — it will simply one that begins with the former and ends with the later. This is the case in our example. As such, the Answer returned by the model will simply be “Southeast Asia.”

That’s it, ladies and gentlemen — finally after 11 long steps we obtain the Answer to our Query!

Here is Mr. Query with “Southeast” and “Asia”, both of whom have been selected to join the Answer Gang. It turns out that the information provided by “Southeast” and “Asia” is just what Mr. Query needs! Mr. Query is happy🎊

Okay, one caveat before we end this series. In the hypothetical case that the Context word with the highest p1 comes after the Context word with the highest p2, we still have a bit of work to do. In this case, we’d need to generate all possible answer spans and calculate the span score for each of them. Here are some examples of possible answer spans for our Query/Context pair:

Possible answer span: “Singapore” ; span score: 0.0000031
Possible answer span: “Singapore is” ; span score: 0.00000006
Possible answer span: “Singapore is a” ; span score: 0.0000000026
Possible answer span: “Singapore is a small” ; span score: 0.0000316

We then take the span with the highest span score to be our Answer.

So that was it — a detailed illustration of each step in BiDAF, from start to finish (sprinkled with a healthy dose of Minion-joy). I hope that this series has helped you in understanding this fascinating NLP model!

If you have any questions/comments about the article or would like to reach out to me, feel free to do so either through LinkedIn or via email at meraldo.antonio AT gmail DOT com.

Glossary

Context : the accompanying text to a query that contains an answer to that query.
Query : the question to which the model is supposed to give an answer.
Answer : a substring of the Context that contains information that can answer the query. This substring is to be extracted out by the model.
Span score : the product of p1 value of the first word in an answer span and p2 value of the last word in the answer span.
T : the number of words in the Context.
J : the number of words in the Query.
G : a big, 8d-by-T matrix that contains Query-aware Context representations. G is the input to the modeling layer.
M1: a 2d-by-T matrix obtained by passing G through a bi-LSTM. M1 contain vector representation of Context words that have information about the entire Context paragraph as well as the Query.
M2: a 2d-by-T matrix obtained by passing M1 through a bi-LSTM. M2, just like M1, contains vector representation of Context words that have information about the entire Context paragraph as well as the Query.
p1: a probability vector with length T. Each Context word has its own p1 value. This p1 value indicates the probability of the word being the first word in the answer span.
p2: a probability vector with length T. Each Context word has its own p2 value. This p2 value indicates the probability of the word being the last word in the answer span.

References

[1] Bi-Directional Attention Flow for Machine Comprehension (Minjoon Seo et. al, 2017)

If you have any comments about the article or would like to reach out to me, feel free to send me a connection through LinkedIn. Also, I’d be very grateful if you could support me by becoming a Medium member through my referral link. As a member, you’ll be able to read all my writings on data science and personal development and have full access to all stories on Medium.