A Gentle Introduction to Transfer Learning in NLP

Transfer Learning is one of the hottest topics in NLP — learn what it is and how you can apply it to your own projects today

Published in

Towards Data Science

6 min readJan 27, 2021

This article gives a brief overview of what Transfer Learning is in Natural Language Processing (NLP) and why it’s the greatest thing since sliced bread. If you want to experience what it’s like to play around with these awesome pretrained models, check out the link to the code here and at the end of the article; you can do some really advanced in NLP with almost zero programming experience.

As a brief primer for this article, Natural Language Processing or NLP, refers to using machine learning to process natural text, with ‘natural’ referring to the kind of text we find in books and newspapers, as opposed to, for example computer programming code (okay, some of the models are learning to code, but we’ll stick to talking more generally about ‘natural’ language here). This technology is driving amazing things from automatic article summarisation, to responsive chatbots and even creative writing generation.

Training a computer vision or natural language model can be expensive. It requires lots of data, and in the case of computer vision, that data needs to be labelled by humans, and it can take lot of time to train on expensive hardware. GPT-2, a benchmark-setting language model released in 2019 by Open AI, is estimated to have cost around $1.6m to train. With costs like that, how is anybody going to compete with Open AI or any of the other organisations doing research in the NLP space?

Here’s the good news though, you don’t need to compete with them. You can download these models for free on the internet, pre-trained on enormous datasets and ready to go. What’s even better is that you can fine-tune these models really quickly to work on the nuances of your specific dataset. To give you a sense of the difference between fine-tuning and training a model, fine-tuning is like taking your car to the mechanic and getting new spark plugs whereas training/pre-training is like getting a whole new engine. As a concrete example, let’s say you run a large food delivery company and you want to get the sentiment (positive or negative intention) of tweets people make about your company so you can quickly attend to the negative ones. Considering you have enough tweets to feed into the model, you could fine-tune it in a really short time to predict whether the tweets are positive or negative. For example, I recently fine-tuned a classification model on food reviews for my master’s research that achieved 98% classification accuracy (predicting whether a review was positive or negative) with only about 40 minutes of training. If you were to train a model from scratch it would likely take a few hours or longer, need an enormous amount of data that you’d need to collect and pre-process yourself and would be less accurate than the fine-tuned model. What a win!

This process of taking a model that’s been trained to do one task — as these pre-trained models have been trained to do — and then fine-tuning it to work on a related but different task is at the essence of what’s called Transfer Learning. It’s been used in computer vision for just over half a decade and has been making waves in NLP for about three years. To understand better how it works, we need to take a short diversion into a high-level understanding of how a language model works.

After converting words into a numerical form machine learning models can understand, these are fed into the main part of the model which is (most often) a deep, multi-layered neural network. The most popular language models at the moment, the Transformer, have a structure where they build a very deep set of relationships between every word in the sentence, creating an incredibly rich (numerical) representation of a sentence. This rich representation is then fed into the last layer of the model which, given a part of the sentence, is trained to predict the next word. It does this by giving its confidence of predicting what the next word is over all of the words in its vocabulary. In the example in the image above, the model is most confident that the next word is ‘oceans’.

One of the fascinating things with Transfer Learning in NLP was that researchers discovered that when you train a model to predict the next word, you can take the trained model, chop off the layer that predicts the next word and put on a new layer and train just that last layer — very quickly — to predict the sentiment of a sentence. Now, keep in mind that this is not what the model was trained to do; it was trained to predict the next word in the sentence; however, it appears to be capturing a lot of pertinent information in a sentence when it processes and transforms into the rich representations which are fed into the last layer to predict the next word. But you can do more than just fine-tune the model to predict the sentiment of a sentence. You can fine-tune the model to be a conversational AI, really quickly (these guys did it in two hours), or you can fine-tune the model to take in questions and pull the answers out of a piece of text relatively quickly.

So why should you care? Previously, language models were generally trained for very specific tasks and often trained from scratch and believe me, training a language model from scratch can be a pain! As a researcher, this means you can test out new ideas really quickly by downloading these pretrained models and tweaking them based on your use case. For example, I’m doing something in unsupervised style transfer for my master’s thesis and being able to use pretrained models for these tasks means I can iterate ideas incredibly quickly. As a practitioner, this means that you can start incorporating NLP into your products or services relatively easily; you need less data to fine-tune the model, you can start to prototype new ideas that utilise cutting-edge technology and you can start to get really creative with it. One caveat though: getting a model that’s between 250MB to 1.5GB to run quickly enough for customers’ needs is surely a mean feat of cloud engineering.

The team over at HuggingFace have done an incredible job of making these models incredibly easy to use. You can basically load up these pre-trained models with just a couple lines of code and start experimenting with the possibilities. If you want to check it out, I’ve put together this online notebook that you can play with. You hardly need any programming experience to play around — just read the instructions — and start to imagine how you could incorporate this amazing technology in your products or services. For reference if you’re not familiar with python, a notebook is basically a simple way of running python scripts.

For something quite light hearted, head over here to play around with the GPT-2 model referenced in this article. Clear the text, type in a sentence or two of your own, press tab and see what the AI comes up with. Playing around with this makes me think about what the future of human-AI artistic interaction will look like.

If you want a more in-depth, technical overview of transfer learning, check out this incredible article and if you want to know how transfer learning applies to other domains like computer vision, check out this article I wrote previously.

A Gentle Introduction to Transfer Learning in NLP

Transfer Learning is one of the hottest topics in NLP — learn what it is and how you can apply it to your own projects today

Written by Neil Sinclair