Deep Learning; Personal Notes Part 1 Lesson 4: Structured learning, Natural language Processing, Collaborative filtering. Dropout, Embeddings, Back Prop Through Time.

Gerald Muriuki
Towards Data Science
34 min readAug 25, 2018

--

This blog post series will be updated as I have a second take on the fast ai lessons. These are my personal notes; a strive to understand things clearly and explain them well. Nothing new, only living up this blog.

Fast.ai takes the approach of “here is how to use the software to do something then looks behind the scenes by looking at the details.”

Dropout

learn = ConvLearner.pretrained(arch, data, ps=0.5, precompute=True)

precompute=True precomputes the activations that come out of the last convolutional layer. An activation is a number calculated based on some weights also known as parameters that make up a kernel or filters that are applied to previous layers activation which could otherwise be the inputs or results of other calculations.

By typing the name of your learner object you can actually see the layers in it:

learnSequential(
(0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)
(1): Dropout(p=0.5)
(2): Linear(in_features=1024, out_features=512)
(3): ReLU()
(4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
(5): Dropout(p=0.5)
(6): Linear(in_features=512, out_features=120)
(7): LogSoftmax()
)

Linear(in_features=1024, out_features=512) a linear layer means a matrix multiplier. In this case we have a matrix with 1024 rows and 512 columns. It takes in 1024 activations and spits out 512 activations.

ReLu. Replaces the negatives with zeros.

Linear(in_features=512, out_features=120) the second linear layer takes the first linear layer activations 512 activations and put them through a new matrix multiply 512 by 120 and outputs 120 activations.

Softmax — An activation function that returns numbers that adds up to 1 and each of them is between 0 and 1. For minor numerical precision reasons, it turns out to be better to take the log of the softmax than softmax directly. That is why when we get predictions out of our models, we have to do np.exp(log_preds).

What does Dropout(p=0.5) do?

Applying dropout means picking activation in a layer and deleting them. p is the probability of deleting an activation. The output activation do not change much.

Randomly throwing away half of the activations in a layer has an interesting effect, it forces it not to overfit. If a particular activation activation had learned just the exact dog or cat it forces the model to find a representation without a random half of the activations.

If you have done all the data augmetation or provided the model with enough data you could, you are left with very few things you could do but to a large degree you are stuck. Geofrey Hinton et al came up with dropout which is lossely inspired by how the brain works.

p=0.01 you are throwing away 1% of activations it will not change things much at all, therefore it is not going to prevent overfitting i.e will not generalize.

p=0.99 you will throw away 99% of the activations. It will also not overfit thus great for generalisation but will kill your accuracty.

High p values generalize well but decrease you training accuracy, low p values will generalize less well but will give you a better accuracy.

Early in training why is the validation losses better than the training losses given that the dataset it has not seen you don’t expect the losses to be much better? this is because when we look at the validation set we trun off dropout. When doing inference i.e predicting we want to be using the best model we can. Dropout happens during training.

Do you have to do anything to accommodate for the fact that you are throwing away activations? We do not(fast.ai) When we say p=0.5 behind the scene pytorch throws away half the activations and doubles the activation already there therefore the average activation does not change.

In Fast.ai, you can pass in ps which is the p value for all of the added layers. It will not change the dropout in the pre-trained network since it should have been already trained with some appropriate level of dropout:

learn = ConvLearner.pretrained(arch, data, ps=0.5, precompute=True)

We can remove dropout by settingps=0.But after a couple of epochs we start to massively overfit. The training loss is less than the validation loss. With ps=0 dropout is not added to the model at all.

You may have noticed, it has been adding two Linear layers. We do not have to do that. There is xtra_fc (extra fully connected layers) parameter you can set. Which you can pass a list of how big you want each of the extra fully connected layer to be. You do need at least one which takes the output of the convolutional layer (1096 in this example) and turns it into the number of classes 120 dog breeds i.e defined by your problem:

If the xtra_fc is empty it means don’t add any additional linear layers just the one we have to have, thus giving us a minimum model.

Is there a particular p value we should be using by default? For the first layer we have p=0.25 and for the second layer we have p=0.5 that seems to work for most things.If you find it overfiting keep increasing it, say to 0.2. If underfitting decrease it.

ResNet34 has less parameters so it does not overfit as much, but for bigger architecture like ResNet50, you often need to increase dropout.

Is there a particular way in which you can determine if it is overfitted? Yes, you can see the training loss is much lower than the validation loss.You cannot tell if it is too overfitted. Zero overfitting is not generally optimal. The only thing you are trying to do is to get the validation loss low, so you need to play around with a few things and see what makes the validation loss low.

Why does average activation matter? If we delete half of the activations, the next activation which takes them as input will also get halved, and everything after that. For example, fluffy ears are fluffy if activation is greater than 0.6, and now it is only fluffy if it is greater than 0.3, which is changs the meaning. The goal here is delete activations without changing the meaning.

Can we have different level of dropout by layer? Yes, that is why it is called ps and we could pass an array: ps =[0.1,0.2]

There is no rule of thumb for when earlier or later layers should have different amounts of dropout yet. If in doubt, use the same dropout for every fully connected layer. Often people only put dropout on the very last linear layer.

Why monitor loss and not accuracy? Loss is the only thing that we can see for both the validation set and the training set and we are able to compare them. As we will learn later, the loss is the thing that we are actually optimizing so it is easier to monitor and understand what that means.

By adding dropout we seem to be adding some random noise and that means we don’t do much learning, do we need to adjust learning rate? It does not seem to impact the learning rate enough to notice. In theory, it might but not enough to affect us.

Structured and Time Series Data

The are two types of columns in the data:

Categorical — it has a number of “level” e.g. StoreType, Assortment

Continuous — it has a number where the difference or ratio of that number has some kind of meaning. e.g CompetitionDistance

We will not cover data cleaning and feature engineering we will assume that has been done. We need to convert to input compatible with a neural network.

This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc…

Numbers like year, month and day though we could treat them as continuous, we don’t have to. when we decide to make some thing categorical we are telling the neural net to treat every level differently. But, when say it is continuous we are telling it to come up with a smooth function to fit them.

So often things that actually are continuous but do not have many distinct levels (e.g. Year, DayOfWeek), it often works better to treat them as categorical. As each day may behave qualitatively differently.

We get to say which variables are categorical and which ones are continuous, this is a modeling decision you have to make.

If something is coded in your data as “a, b, c” you have to call it as a categorical, if it starts out as continuous you have to choose whether to treat that as categorical or continuous.

Note that the continuous variables are actual floating point numbers. Floating numbers have many levels i.e it has high cardinality e.g. the cardinality of the DayOfWeek is 7.

Do you ever bin continuous variables?Not at the moment but one thing we could do with, say Max_Temperature, is to group into 0–10, 10–20, 20–30, and call that categorical. A group of researchers found that sometimes binning can be helpful.

If you are using year as a category, what happens when a model encounters a year it has never seen before? It will be treated as an unknown category. Pandas has a special category called unknown and if it sees a category it has not seen before, it gets treated as unknown.

If our training data set does not have categories but out test have unknow what will the model do, will it predict? It will be part of the unknown category, it will still predict it will process the value 0 behind the seen, if there are unknown in the training set it will have learnt to predict with them if it hasn’t it will have a random vector.

n = len(joined); n
844338

We have 844338 rows that is the dates at each store.

Loop through cat_vars and turn applicable data frame columns into categorical columns.

Loop through contin_vars and set them as float32 (32 bit floating point) because that is what PyTorch expects. for example Promo, SchoolHoliday

Start with a sample

We tend to start with small sample of the data set. For images that would mean resizing the image to 64 x 64 or 128 x 128.But with structured data we start out with a random sample of rows.

Thus giving us 150000 rows of data to start with.

Looking at the data:

Even though we set some of the columns as “category” e.g. ‘StoreType’, ‘Year’. That was saved internally, Pandas still display as string in the notebook.

proc_df (process data frame) is a fast ai function that:

Takes a dataframe tell it what your dependent variable is, puts it into a separate variable, and deletes it from the original data frame. In other words, df do not have Sales column, and y contains theSales column.

do_scale Neural nets really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around 1. To make this happen we take our data, subtract the mean, and divide it by the standard deviation. It returns a special object mapper which keeps track of what mean and standard deviation it used for that normalization so you can do the same to the test set later.

nas It also handles missing values, for categorical variable, it becomes ID: 0 and other categories become 1, 2, 3, and so on. For continuous variable, it replaces the missing value with the median and create a new column that is boolean that says whether it was missing or not.

Before preprocessing
After preprocessing

Year 2014 for example becomes 2 since categorical variables have been replaced with contiguous integers starting at zero. The reason for that is, we are going to put them into a matrix later, and we would not want the matrix to be 2014 rows long when it could just be two rows.

we now have a data frame which does not contain the dependent variable and where everything is a number. That is where we need to get to to do deep learning.

In this case, we need to predict the next two weeks of sales therefore we should create a validation set which is the last two weeks of our training set.

In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail here. One approach is to take the last 25% of rows (sorted by date) as our validation set.

Putting together our model

it is important that you have a strong understanding of your metric i.e how you are going to be judged. In this competition, we are going to be judged on Root Mean Square Percentage Error (RMSPE).

This mean that we take the actual sale minus the predicted value of a single prediction find its percentage then find the average.

When you take the log of the data, getting the root mean squared error will actually get you the root mean square percentage error:

We can create a ModelData object directly from out data frame:

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, 
yl.astype(np.float32), cat_flds=cat_vars, bs=128,
test_df=df_test)

We will start by creating model data object md which has a validation set, training set, and optional test set built into it. From that, we will get a learner, we will then optionally call lr_find, then call learn.fit and so forth.

The difference here is that we are not using ImageClassifierData.from_csv or from_paths, we need a different kind of a model data called ColumnarModelData and we call from_data_frame.

Path specifies where to store model file e.t.c

val_idx list of indexes of rows we will put in the validation set.

df the dataframe the independent variables.

yl dependent variable which is the log of y returned by proc_df i.e yl = np.log(y)

cat_flds specifies which things you want treated as categorical. Everything was turned into a number unless we specify, it will treat them all as continuous. pass the list of names cat_vars

bs batch size

Now we have a standard model data object that contains train_dl, val_dl , train_ds , val_ds , etc.

Creating a learner suitable for our model data:

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
0.04, 1, [1000,500], [0.001,0.01],
y_range=y_range)
  • 0.04 : how much dropout to use at the very start
  • [1000,500] : how many activations to have in each layer
  • [0.001,0.01] : how many dropout to use at later layers
  • emb_szs embeddings

Embeddings

let’s keep away categorical variables for a moment and look at the continuous variables:

You never want to put ReLU in the last layer because softmax needs negatives to create low probabilities.

Simple view of a fully connected layer:

Takes in an input as a rank one tensor put it through a linear layer (matrix product), through an activation(ReLu), linear layer again then softmax and finally an output. We can add more linear layers or dropout.

Skip the softmax for regression problems.

Categorical variables

We create a new matrix of 7 rows and as many columns as we choose 4, for example and fill it with floating numbers. To add “Sunday” to our rank 1 tensor with continuous variables, we do a look up to this matrix, which will return 4 floating numbers, and we use them as “Sunday”.

Initially, these numbers are random. But we can put them through a neural net and update them using gradient descent in a way that reduces the loss. This matrix is just another bunch of weights in our neural net called an embedding matrix

An embedding matrix is something where we start out with an integer between zero and maximum number of levels of that category. We index into the matrix to find a particular row, and we append it to all of our continuous variables, and everything after that is just the same as before (linear → ReLU → linear etc).

What do those 4 numbers represent? They are just parameters that we are learning that happen to end up giving us a good loss. We will discover later that these particular parameters often are human interpretable and quite interesting but that a side effect. They are a set of 4 random numbers we are learning.

Do you have good heuristics for the dimensionality of the embedding matrix? We use the cardinality of each variable (that is, its number of unique values) to decide how large to make its embeddings.

Above is a list of every categorical variable and its cardinality.

We have 7 days of the week but on looking at DayOfWeek cardinality is 8 the extra cardinality is added just incase there is an unknown in the test set, divide that by 2, thus the 4 random numbers. Even if there were no missing values in the original data, you should still set aside one for unknown just in case. Years are also 3 but we add one for the unknown e.t.c

The rule of thumb for determining the embedding size is the cardinality size divided by 2, but no bigger than 50.

Looking at the store we will have 1116 stores for look up that returns a rank one tensor of length 50. We have to build an embedding matrix for every categorical variable.

We then pass then embedding sizes emb_szs to get_learner which tells the learner that for every categorical variable which embedding to use for that variable.

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.summary()

Is there a way to initialize embedding matrices besides random? The basic idea is if somebody else at Rossmann had already trained a neural network to predict cheese sales, you may as well start with their embedding matrix of stores to predict liquor sales. This is what happens, for example, at Pinterest and Instacart. Instacart uses this technique for routing their shoppers, and Pinterest uses it for deciding what to display on a webpage. They have embedding matrices of products/stores that get shared in the organization so people do not have to train new ones.

What is the advantage of using embedding matrices over one-hot-encoding?For the day of week example above, instead of the 4 numbers, we could have easily passed 7 numbers e.g. [0, 1, 0, 0, 0, 0, 0] for Sunday. That also is a list of floats and would totally work and that is how, generally speaking, categorical variables have been used in statistics for many years called “dummy variables”. The problem with that is, the concept of Sunday could only ever be associated with a single floating point number. So it gets this kind of linear behavior, it says Sunday is more or less of a single thing. With embeddings, Sunday is a concept in a four dimensional space. What happens is that these embedding vectors tend to get these rich semantic concepts. For example, if it turns out that weekends have a different behavior you tend to see that Saturday and Sunday will have some particular number higher or for example certain days of the week tend to be associated with higher sales for example buying of liquor on Friday.

By having higher dimensionality vector rather than just a single number, it gives the deep learning network a chance to learn these rich representations.

The idea of an embedding is what is called a “distributed representation” — the most fundamental concept of neural networks. This is the idea that a concept in neural network has a high dimensional representation which can be hard to interpret. These numbers in this vector does not even have to have just one meaning. It could mean one thing if this is low and that one is high, and something else if that one is high and that one is low because it is going through this rich nonlinear function. It is this rich representation that allows it to learn such interesting relationships.

Are embeddings suitable for certain types of variables? Embedding is suitable for any categorical variables. The only thing it cannot work well for would be something with too high cardinality. If you had 600,000 rows and a variable had 600,000 levels, that is just not a useful categorical variable. But in general, the third winner in this competition really decided that everything that was not too high cardinality, they put them all as categorical. The good rule of thumb is, if you can make a categorical variable, you may as well because that way it can learn this rich distributed representation; where else if you leave it as continuous, the most it can do is to try and find a single functional form that fits it well.

How matrix algebra works behind the scene:

Doing a lookup is identical to doing a matrix product between a one-hot encoded vector and the embedding matrix.

After multiplying you are left with only one vector. Modern libraries implement this as taking an integer and doing a look up into an array.

Could you touch on using dates and times as categorical and how that affects seasonality? The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities. We’ll add to every table with a date field.

add_datepart function takes a data frame and a column name. It optionally removes the column from the data frame and replaces it with lots of column representing all of the useful information about that date such as day of week, day of month, month of year, is it start of quater, end of quater etc. we end up with a list of our features:

For example, DayOfWeek now becomes eight rows by four columns embedding matrix. Conceptually this allows our model to create some interesting time series models. If there is something that has a seven day period cycle that goes up on Mondays and down on Wednesdays but only for daily and only in Berlin, it can totally do that, it has all the information it needs.

This is a fantastic way to deal with time series. You just need to make sure that the cycle indicator in your time series exists as a column. If you did not have a column called day of week, it would be very difficult for the neural network to learn to do divide mod seven and look up in an embedding matrix. It is not impossible but really hard.

For example if you are predicting sales of beverages in San Francisco, you probably want a list of when the ball game is on at AT&T park because that is going to to impact how many people are drinking beer in SoMa. So you need to make sure that the basic indicators or periodicity is in your data, and as long as they are there, neural net is going to learn to use them.

Learner

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.summary()

emb_szs embedding size

len(df.columns)-len(cat_vars) number of continuous variables in the data frame.

0.04 embedding matrix has its own dropout and this is the dropout rate.

1 how many outputs we want to create i.e output of the last linear layer which is sales.

[1000, 500] number of activations in the first linear layer, and the second linear layer.

[0.001,0.01] dropout in the first linear layer, and the second linear layer.

We have a leaner, let’s find the learning rate(lr = 1e-3):

Fit

start with sample data:

metrics this is a custom metric which specifies a function exp_rmspe that is called at the end of every epoch and prints out a result.

fit all the data:

m.fit(lr, 1, metrics=[exp_rmspe], cycle_len=1)[ 0.       0.00676  0.01041  0.09711]

By using all of the training data, we get a RMSPE of 0.09711. This would have gotten us at the top% of the leaderboard.

So this is a technique for dealing with time series and structured data. Interestingly, compared to the group that used this technique Entity Embeddings of Categorical Variables, the second place winner did way more feature engineering. The winners of this competition were actually subject matter experts in logistics sales forecasting so they had their own code to create lots and lots of features.

Folks at Pinterest who build a very similar model for recommendations also said that when they switched from gradient boosting machines to deep learning, they did way less feature engineering and it was much simpler model which requires less maintenance. So this is one of the big benefits of using this approach to deep learning, you can get state of the art results but with a lot less work.

Are we using any time series in any of these? Indirectly, yes. As we just saw, we have a DayOfWeek, MonthOfYear, etc in our columns and most of them are being treated as categories, so we are building a distributed representation of January, Sunday, and so on. We are not using any classic time series techniques, all we are doing is two fully connected layers in a neural net. The embedding matrix is able to deal with things like day of week periodicity in a richer way than than any standard time series techniques.

What is the difference between image models and this model? There is a difference in a way we are calling get_learner. In imaging we just did Learner.trained and pass the data:

learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True)

For these kinds of models, in fact for a lot of the models, the model we build depends on the data. In this case, we need to know what embedding matrices we have. So in this case, the data objects creates the learner:

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
[1000,500], [0.001,0.01], y_range=y_range)

Steps

Step 1. List categorical variable names, and list continuous variable names, and put them in a Pandas dataframe:

cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
'SchoolHoliday_fw', 'SchoolHoliday_bw']
contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h',
'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']

Step 2. Create a list of which row indexes you want in your validation set:

val_idx = np.flatnonzero(
(df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

Step 3. Call this exact line of code:

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, 
yl.astype(np.float32), cat_flds=cat_vars, bs=128,
test_df=df_test)

Step 4. Create a list of how big you want each embedding matrix to be:

emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
emb_szs
[(1116, 50),
(8, 4),
(4, 2),
(13, 7),
(32, 16),
(3, 2),
(26, 13),
(27, 14),
(5, 3),
(4, 2),
(4, 2),
(24, 12),
(9, 5),
(13, 7),
(53, 27),
(22, 11),
(7, 4),
(7, 4),
(4, 2), ...

Step 5. Call get_learner — you can use these exact parameters to start with, it it overfits you can play around with them.

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
[1000,500], [0.001,0.01], y_range=y_range)

Step 6. Call m.fit

m.fit(lr, 3, metrics=[exp_rmspe])

How to use data augmentation for this type of data, and how does dropout work? No idea. May be it has to be domain-specific, but he(Jeremy) has never seen any paper or anybody in industry doing data augmentation with structured data and deep learning. He thinks it can be done but has not seen it done. What dropout is doing is exactly is throwing half of the activation of a rank one tensor. Which also applies to the embedding matrix.

What is the downside? Almost no one is using this. Why not? Basically the answer is , no one in academia is working on this because it is not something that people publish on. As a result, there have not been really great examples people could look at and say “oh here is a technique that works well so let’s have our company implement it”. But perhaps equally importantly, until now with this Fast.ai library, there has not been any way to do it conveniently. If you wanted to implement one of these models, you had to write all the custom code yourself. Now it is 6 step process. There are a lot of big commercial and scientific opportunity to use this and solve problems that previously haven’t been solved very well.

Natural Language Processing

This is the most most up-and-coming area of deep learning and it is two or three years behind computer vision. The state of software and some of the concepts is much less mature than it is for computer vision.

One of the things you find in NLP is there are particular problems you can solve and they have particular names. There is a particular kind of problem in NLP called “language modeling” — it means building a model given a few words of a sentence, to predict what the next word is going to be.

For example when typing and you press space and the next word is suggested that is a language model.

For this exercise we downloaded 18 months papers from arXiv.org

Data:

<cat> — category of the paper. CSNI is Computer Science and Networking.

<summ> — abstract of the paper.

Let’s look at the output of a trained language model we pass the category in this case csni (computer science and networking) and some priming text:

The model learned by reading arXiv papers that somebody who is writing about computer networking would talk like this. Remember, it started out not knowing English at all. It started out with an embedding matrix for every word in English that was random. By reading lots of arXiv papers, it learned what kind of words followed others.

The model not only learned how to write English pretty well, but also after you say something like “convolutional neural network” it use parenthesis to specify an acronym “(CNN)”.

We will try to create a pre-trained model which is used to do some other tasks. For example, given IMDB movie reviews, we will figure out whether they are positive or negative. Which is a classification problem.

We would like to use a pre-trained network which at least knows how to read English. So we will train a model that predicts the next word of a sentence i.e. language model, and just like in computer vision, stick some new layers on the end and ask it to predict whether something is positive or negative.

Basically, create a language model and make it the pre-trained model for a classification model.

Why would doing directly what you want to do doesn’t work better? It just turns out it doesn’t perform well empirically. There are several reasons. First of all, we know fine-tuning a pre-trained network is really powerful. So if we can get it to learn some related tasks first, then we can use all that information to try and help it on the second task. The other reason is IMDB movie reviews are up to a 1000 words long. So after reading a 1000 words(integers) knowing nothing about how English is structured or the concept of a word or punctuation is, all you get is a 1 or a 0 (positive or negative). Trying to learn the entire structure of English and then how it expresses positive and negative sentiments from a single number is just too much to expect. By building a language model first we build a neural net that understand the English language of movie review then we hope that the things it learns will be helpful in deciding if a review is positive or negative.

Is this similar to Char-RNN by Karpathy? This is somewhat similar to Char-RNN which predicts the next letter given a number of previous letters. Language model generally work at a word level but they do not have to, and we will focus on word level modeling.

To what extent are these generated words and sentences actual copies of what it found in the training dataset? The words are definitely words it has seen before because it is not at character level so it can only give us the words it has seen before. The Sentences, there are several rigorous ways of doing it but the easiest way would be by looking at examples like above, Where it creates two categories and looking at them it creates similar concepts.

Most importantly, when we train the language model, we will have a validation set so that we are trying to predict the next word of something that has never seen before.

Examples of text classification:

  1. Hedge fund, might identify things in articles or Twitter that caused massive market drops in the past.
  2. Recognising customer service queries which tend to be associated with people who cancel their subscriptions in the next month.
  3. Classifying if documents into whether they are part of legal discovery or not.

IMDB

Imports

torchtext PyTorch’s NLP library.

Data

The large movie view dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

However, before we try to classify sentiment, we will simply try to create a language model; that is, a model that can predict the next word in a sentence. Why? Because our model first needs to understand the structure of English, before we can expect it to recognize positive vs negative sentiment.

So our plan of attack is the same as we used for Dogs v Cats: pretrain a model to do one thing (predict the next word), and fine tune it to do something else (classify sentiment).

Unfortunately, there are no good pre-trained language models available to download, so we need to create our own.

We do not have separate test and validation in this case. The training directory has bunch of files in it representing movie reviews:

Checking out an example of a movie review:

let’s check how many words are in the training dataset and test dataset respectively:

Before we can analyze text, we must first tokenize it. This refers to the process of splitting a sentence into an array of words or more generally, into an array of tokens.

A good tokenizer will do a good job of recognizing pieces in your sentence. Each separated piece of punctuation will be separated, and each part of multi-part word will be separated as appropriate below is an example of a tokenization:

We use Pytorch’s torchtext library to preprocess our data, telling it to use the spacy library to handle tokenization.

Create a field

First, we create a torchtext field, which describes how to preprocess a piece of text — in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

TEXT = data.Field(lower=True, tokenize="spacy")

lower=True — lowercase the text

tokenize=spacy_tok — tokenize with spacy_tok

We create a ModelData object for language modeling by taking advantage of LanguageModelData, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use VAL_PATH for that too.

bs=64; bptt=70FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

PATH where the data is, where to save models e.t.c

TEXT torchtext’s Field definition of how to preprocess text.

**FILES list of all of the files we have: training, validation, and test

bs batch size

bptt Back Prop Through Time. It means how long a sentence we will stick on the GPU at once

min_freq=10 : In a moment, we are going to be replacing words with integers i.e a unique index for every word. If there are any words that occur less than 10 times, just call them unknown.

After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

(Technical note: python’s standard Pickle library can't handle this correctly, so at the top of this notebook we used the dill library instead and imported it as pickle).

This is the start of the mapping from integer IDs to unique tokens:

vocab let’s us take a word and match it to an integer and an integer to a word. We will work using integers.

Is it common to do any stemming or lemma-tizing? Not really, no. Generally tokenization is what we want. To keep it as general as possible, we want to know what is coming next so whether it is future tense or past tense or plural or singular, we don’t really know which things are going to be interesting and which are not, so it seems that it is generally best to leave it alone as much as possible.

When dealing with natural language, isn’t context important? Why are we tokenizing and looking at individual word? No, we are not looking at individual word — they are still in order. Just because we replaced I with a number 12, they are still in that order.

There is a different way of dealing with natural language called “bag of words” and they do throw away the order and context. In the Machine Learning course, we will be learning about working with bag of words representations but they are no longer useful or in the verge of becoming no longer useful. We are starting to learn how to use deep learning to use context properly.

Batch size and Back Prop Through Time (BPTT)

What happens in a language model is even though we have lots of movie reviews, they all get concatenated together into one big block of text. So we predict the next word in this huge long thing which is all of the IMDB movie reviews concatenated together.

We start with a big block of text with for example 64 million words. We split up the concatenated reviews into batches. In this case, we will split it to 64 sections. We then move each section underneath the previous one, and transpose it. We end up with a matrix which is 1 million by 64. We then grab a little chunk at time and those chunk lengths are approximately equal to BPTT(this changes with every epoch). We grab a little 70 long section and that is the first thing we chuck into our GPU i.e. the batch.

Our LanguageModelData object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our bptt parameter - backprop through time).

Each batch also contains the exact same data as labels, but one word later in the text — since we’re trying to always predict the next word. The labels are flattened into a 1d array.

We grab our first training batch by wrapping the data loader(md.trn_dl) with iter then calling next

We get back a 77 by 64 tensor which is approximately 70 rows but not exactly.

Torchtext randomly change the bptt number every time so each epoch it gets slightly different bits of text — similar to shuffling images in computer vision. We cannot randomly shuffle the words because they need to be in the right order, so instead, we randomly move their breakpoints a little bit.

77 represents the first 77 words of the first movie review.

Why not split by a sentence? Not really. Remember, we are using columns. So each of our column is of length about 1 million, so although it is true that those columns are not always exactly finishing on a full stop, they are so long we do not care. Each column contains multiple sentences.

Creating a model

Now that we have a model data object that can feed us batches, we can create a model. First, we are going to create an embedding matrix.

4602 number of batches

34945 number of unique tokens in the vocab (unique words have to appear at least 10 times min_freq=10 otherwise they will be replased with Unk)

1 length of the dataset i.e the whole corpus

20621966 number of words in the corpus .

34945 is used to create an embedding matrix:

Each word gets an embedding vector. This is a categorical variable it is just a very high cardinality categorical variable. It is 34945 categorical variable that is why we create an embedding matrix for it.

We have a number of parameters to set:

em_sz = 200  # size of each embedding vector
nh = 500 # number of hidden activations per layer
nl = 3 # number of layers

The embedding size is 200 which is much bigger than our previous embedding vectors. Not surprising because a word has a lot more nuance to it than the concept of Sunday.

Generally, an embedding size for a word will be somewhere between 50 and 600.

Researchers have found that large amounts of momentum don’t work well with these kinds of RNN (Recurrent neural networks) models, so we create a version of the Adam optimizer with less momentum than it’s default of 0.9.

opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

Fastai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. There is no simple way known (yet!) to find the best values of the dropout parameters below — you just have to experiment…

However, the other parameters (alpha, beta, and clip) shouldn't generally need tuning

If you try to build an NLP model and you are under-fitting, then decrease all these dropouts, if overfitting, then increase all these dropouts in roughly this ratio.

There are other kind of ways we can avoid overfitting. For now, learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1) works reliably so all of your NLP models probably want this particular line.

learner.clip=0.3 when you look at your gradients and you multiply them by the learning rate and decide how much to update your weights by, this will not let them be more than 0.3. Prevents us from taking too big steps (high learning rate).

There are word embedding out there such as Word2vec or GloVe. How are they different from this? And why not initialize the weights with those initially? People have pre-trained these embedding matrices before to do various other tasks. They are not called pre-trained models; they are just a pre-trained embedding matrix and you can download them. There is no reason we could not download them. Building a whole pre-trained model in this way did not seem to benefit much if at all from using pre-trained word vectors; where else using a whole pre-trained language model made a much bigger difference. Maybe we can combine both to make them a little better still.

What is the architecture of the model? It is a recurrent neural network using something called LSTM (Long Short Term Memory).

Fitting the model

learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)
learner.save_encoder('adam1_enc')
learner.load_encoder('adam1_enc')
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

In the sentiment analysis section, we’ll just need half of the language model — the encoder, so we save that part:

learner.save_encoder('adam3_10_enc')
learner.load_encoder('adam3_10_enc')

Language modeling accuracy is generally measured using the metric perplexity, which is simply exp() of the loss function we used.

math.exp(4.165)64.3926824434624pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Testing

We can play around with our language model a bit to check it seems to be working OK. First, let’s create a short bit of text to ‘prime’ a set of predictions. We’ll use our torchtext field to numericalize it so we can feed it to our language model.

We haven’t yet added methods to make it easy to test a language model, so we’ll need to manually go through the steps:

Let’s see what the top 10 predictions were for the next word after our short text:

Let’s see if our model can generate a bit more text all by itself:

Sentiment Analysis

We have a pre-trained language model and now we want to fine-tune it to do sentiment classification.We’ll need the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

sequential=False tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

We need to not treat the whole thing as one big piece of text but every review is separate because each one has a different sentiment attached to it.

splits is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at lang_model-arxiv.ipynb to see how to define your own fastai/torchtext datasets.

Splits allows us to look at a single label t.label and some of the text t.text

fastai can create a ModelData object directly from torchtext splits that we can train on md2:

get_model gets us our learner, then we can load into it the pre-trained language model m3.load_encoder(f’adam3_10_enc’).

Because we’re fine-tuning a pre-trained model, we’ll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

We make sure all except the last layer is frozen. Then we train a bit, unfreeze it, train it a bit. The nice thing is once you have got a pre-trained language model, it actually trains really fast.

A recent paper from Bradbury et al, Learned in translation: contextualized word vectors, has a handy summary of the latest academic research in solving this IMDB sentiment analysis problem. Many of the latest algorithms shown are tuned for this specific problem.

As you see, we just got a new state of the art result in sentiment analysis, decreasing the error from 5.9% to 5.5%! You should be able to get similarly world-class results on other NLP classification problems using the same basic steps.

There are many opportunities to further improve this:

For example we could start training language models that look at lots of medical journals and then make a downloadable medical language model that then anybody could use to fine-tune on a prostate cancer subset of medical literature.

We could also combine this with pre-trained word vectors.

We could have pre-trained a Wikipedia corpus language model and then fine-tuned it into an IMDB language model, and then fine-tune that into an IMDB sentiment analysis model and we would have gotten something better than this.

Collaborative Filtering

Data:

Movie lens data

http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

path='data/ml-latest-small/'

It contains a userId the movie movieId rating and timestamp

Our goal will be for some user-movie combination we have not seen before, we have to predict if they will like it. This is how recommendation engines are built.

Let’s read the movie names too:

We will create(in the next lesson) a kind of cross tab of users by movies:

Thanks for reading! follow @itsmuriuki.

Back to learning!

--

--