Transfer learning: leveraging insights from large data sets

Lars Hulstaert
Towards Data Science
7 min readJan 22, 2018

--

In this blog post, you’ll learn what transfer learning is, what some of its applications are and why it is critical skill as a data scientist.

Transfer learning is not a machine learning model or technique; it is rather a ‘design methodology’ within machine learning. Another type of ‘design methodology’ is, for example, active learning.

Originally published at https://www.datacamp.com/community/tutorials/transfer-learning

This blog post is the first in a series on transfer learning. You can find the second blog post, which discusses two applications of transfer learning here.

In a follow-up blog post I will explain how you can use active learning in conjunction with transfer learning to optimally leverage existing (and new) data. In a broad sense, machine learning applications use transfer learning when they leverage external information to improve the performance or generalisation capabilities.

Transfer Learning: A Definition

The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labelled data is available in settings where only little labelled data is available. Creating labelled data is expensive, so optimally leveraging existing datasets is key.

In a traditional machine learning model, the primary goal is to generalise to unseen data based on patterns learned from the training data. With transfer learning, you attempt to kickstart this generalisation process by starting from patterns that have been learned for a different task. Essentially, instead of starting the learning process from a (often randomly initialised) blank sheet, you start from patterns that have been learned to solve a different task.

Being able to distinguish lines and shapes (left) from an image makes it easier to determine if something is a 'car' than having to start from the raw pixel values. Transfer learning allows you to leverage learned patterns from other computer vision models.
Different approaches exist to represent words in NLP (a word embedding like representation on the left and a BoW like representation on the right). With transfer learning a machine learning model can leverage the relationships that exist between different words.

Transfer of knowledge and patterns is possible in a wide variety of domains. Today’s post will illustrate transfer learning by looking at several examples of these different domains. The goal is to incentivise data scientists to experiment with transfer learning in their machine learning projects and to make them aware of the advantages and disadvantages.

There are three reasons why I believe a good understanding of transfer learning is a critical skill as a data scientist:

  • Transfer learning is essential in any kind of learning. Humans are not taught every single task or problem in order to be successful at it. Everyone gets into situations that have never been encountered, and we still manage to solve problems in an ad-hoc manner. The ability of learning from a large number of experiences, and exporting ‘knowledge’ into new environments is exactly what transfer learning is all about. From this perspective, transfer learning and generalisation are highly similar on a conceptual level. The main distinction is that transfer learning is often used for ‘transferring knowledge across tasks, instead of generalising within a specific task’. Transfer learning is thus intrinsically connected to the idea of generalisation that is necessary in all machine learning models.
  • Transfer learning is key to ensure the breakthrough of deep learning techniques in a large number of small-data settings. Deep learning is pretty much everywhere in research, but a lot of real-life scenarios typically do not have millions of labelled data points to train a model. Deep learning techniques require massive amounts of data in order to tune the millions of parameters in a neural network. Especially in the case of supervised learning, this means that you need a lot of (highly expensive) labelled data. Labelling images sounds trivial, but for example in Natural Language Processing (NLP), expert knowledge is required to create a large labelled dataset. The Penn treebank for example, a Part-of-Speech tagging corpus, was 7 years in the making and required close cooperation of many trained linguists. Transfer learning is one way of reducing the required size of datasets in order for neural networks to be a viable option. Other viable options are moving towards more probabilistically inspired models, which typically are better suited to deal with limited data sets.
  • Transfer learning has significant advantages as well as drawbacks. Understanding these drawbacks is vital for successful machine learning applications. Transfer of knowledge is only possible when it is ‘appropriate’. Exactly defining what appropriate means in this context is not easy, and experimentation is typically required. You should not trust a toddler that drives around in a toy car to be able to ride a Ferrari. The same principle holds for transfer learning: although hard to quantify, there is an upper limit to transfer learning. It is not a solution that fits all problem cases.

General Concepts in Transfer Learning

The Requirements of Transfer Learning

Transfer learning, as the name states, requires the ability to transfer knowledge from one domain to another. Transfer learning can be interpreted on a high level. One example is that architectures in NLP can be re-used in sequence prediction problems, since a lot of NLP problems can inherently be reduced to sequence prediction problems. Transfer learning can also be interpreted on a low level, where you are actually reusing parameters from one model in a different model (skip-gram, continuous bag-of-words, etc.). The requirements of transfer learning are on one hand problem specific and on the other one model specific. The next two sections will discuss respectively a high level and low level approach to transfer learning. Although you will typically find these concepts with different names in literature, the overarching concept of transfer learning is still present.

Multi-task Learning

In multi-task learning, you train a model on different tasks at the same time. Typically, deep learning models are used as they can be adapted flexibly.

The network architecture is adapted in such a way that the first layers are used across different tasks, followed with different task-specific layers and outputs for the different tasks. The general idea is that by training a network on different tasks, the network will generalise better as the model is required to perform well on tasks for which similar ‘knowledge’ or ‘processing’ is required.

An example in the case of Natural Language Processing is a model for which the end goal is to perform entity recognition. Instead of training the model purely on the entity recognition task, you also use it to perform part of speech classification, next word prediction, … As such, the model wil benefit from the structure learned from those tasks and the different datasets. I highly recommend Sebastian Ruder’s blogs on multi-task learning to learn more about multi-task learning.

Featuriser

One of the great advantages of a deep learning model is that feature extraction is ‘automatic’. Based on the labelled data and backpropagation, the network is able to determine the useful features for a task. The network ‘figures out’ what part of the input is important in order to, for example, classify an image. This means that the manual job of feature definition is abstracted away. Deep learning networks can be reused in other problems, as the type of features that are extracted, are often useful for other problems as well. Essentialy, in a featuriser you use the first layers of the network to determine the useful feature, but you dont use the output of the network, as it is too task-specific.

Given that deep learning systems are good at feature extraction, how can you reuse existing networks to perform feature extraction for other tasks? It is possible to feed a data sample into the network, and take one of the intermediate layers in the network as output. This intermediate layer can be interpreted as a fixed length, processed representation of raw data. Typically, the concept of a featuriser is used in the context of computer vision. Images are then fed into a pre-trained network (for example, VGG or AlexNet) and a different machine learning method is used on the new data representation. Extracting an intermediate layer as a representation of the image significantly reduces the original data size, making them more amenable for traditional machine learning techniques (for example, logistic regression or SVMs work better with a small representation of an image, such as dimension 128, compared to the original, for example, 128x128=16384 dimension).

In the next blog post I will discuss two applications of transfer learning more in depth, both of them with concrete examples!

If you have any questions, I’ll be happy to read them in the comments. Follow me on Medium or Twitter if you want to receive updates on my blog posts!

--

--

Data Scientist at J&J, ex-Microsoft. Previously Masters student at Cambridge, Engineering student in Ghent. I like connecting the dots.