Pre-trained language model in any language

Rrohan.Arrora
AI n U

--

In this post, I will create a pre-trained language model in the Vietnamese language. But, I assure the users that they will be able to develop in any language. If you want to learn about the transfer learning and pre-trained model, read this post. I would suggest the readers to explore fastai as well. So, let us get started.

Image Courtesy — Google, Predict the next word…

To analyze sentiments in any language using transfer learning, we need a pre-trained model in that language. So, we will start with a pre-trained model which can be used in semantic analysis. Now, what is transfer learning and why do we need the pre-trained model, I suggest you go through the link mentioned above. I am using google colab for practical.

Let us import the fastai libraries

from fastai import *
from fastai.text import *
bs=128
data_path = Config.data_path()

Now, I will create a folder to store the Wikipedia content required to create the language model.

lang = 'vi'name = f'{lang}wiki'
path = data_path/name
path.mkdir(exist_ok=True, parents=True)
lm_fns = [f'{lang}_wt', f'{lang}_wt_vocab']

This will create a viwiki folder containing a viwiki text file with the Wikipedia contents. Now, we are saving both the language.

I have mentioned “vi” because I want to download Vietnamese Wikipedia. If you want to download for any other language, mention that language as per the Wikipedia standards like “en” for English, “es” for Spanish, etc.

Now, the fastai provided nlputils the module that provides the below essential functions. Let us understand it.

Thus, I will use get_wiki function to download the wiki content for the Vietnamese language.

Now, once the data is downloaded, it is the simple text file with all the articles placed one after the other. So, we require to split the articles with proper names and content. Fastai nlputils provided one more function explained below.

You know what is the best part of fastai is that it cares for the end-user, and therefore, it provides you most of the things required for pre-processing.

get_wiki(path,lang)
dest = split_wiki(path,lang)
dest.ls()[:5]

The above image shows the files, one for each Vietnamese Wikipedia article. Using the above procedure, you can download content in any language, believe me, any language unless Wikipedia supports it.

Now, creating pre-trained in Vietnamese is very much similar to creating a pre-trained model for IMDB. And I will use the fastai data block API.

data = (TextList.from_folder(dest)
.split_by_rand_pct(0.1, seed=42)
.label_for_lm()
.databunch(bs=bs, num_workers=1))
  • Above, we are creating the data bunch of the TextList type.
  • Path in from_folder is defined above.
  • We want to split the data for the validation dataset randomly.
  • Lastly, we are creating the data bunch, with batch size and the number of workers defined by the user.

Finally, we create a language model.

learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5, pretrained=False)
  • You may have noticed something different; that is, we have set pretrained=false.
  • We do so because we do not want to pre-train our model on the English language as we do in the case of the IMDB dataset, where we pre-train our language model on English WikiText, which is the transfer learning, which we don’t need this time. This is a catch, which we need to understand and keep in mind.
  • Since we are starting with random weights, no transfer learning, therefore, we need not fit the model on predefined weights. So, we will directly dive into unfreezing the model.
lr=1e-3
(lr varies depending on the model)
learn.unfreeze()
learn.fit_one_cycle(10, lr, moms=(0.8,0.7))
  • We end up with the model which 42% times predict correctly, what the next word would be.
  • Now, finally, we need to save the language model.
mdl_path = path/'models'
mdl_path.mkdir(exist_ok=True)
learn.save(mdl_path/lm_fns[0], with_opt=False)
learn.data.vocab.save(mdl_path/(lm_fns[1] + '.pkl'))
  • Language Model is just the word embeddings where we store each word as an embedding.
  • Therefore, we store the vocab along with the model.
  • When we download the pre-train model in fastai, it downloads the language model along with the vocab. Therefore, we are storing them both.

That’s all; now you have a pre-trained model which is good enough to suggest the next word. Now, you can use this language model to do sentiment analysis. And the best part is that you can now do it in any language.

--

--

Rrohan.Arrora
AI n U
Editor for

demystifying the theories, construing the results.