The International Conference on Machine Learning took place last July in Stockholm. Altogether it showcased many interesting trends and directions in machine learning. Since, ICML was such a huge conference I will focus my attention on a few (of the many) interesting strands going on at the conference.
Specifically, this year’s ICML split the oral talks into several different "tracks/sessions." I was happy to see three of theses sessions focused on "transfer and multitask learning" as this has long been an area of interest of mine. Additionally, a large number of posters dealt with theses concepts as well as several orals from other tracks.
Lack of large amounts of clean labeled data remains a barrier to the potential impact of Deep Learning. For many tasks there is an overall lack of data points (e.g., forecasting elections, diagnosing rare diseases, translating into rare or extinct languages …). In other cases the data is there but it is noisy or poorly labeled (e.g., images scraped from Google under a specific keyword, medical cases assigned labels through NLP, a text corpus only partially annotated). Whatever the reason, there is a tangible benefit to finding methods to learn from either limited or noisy (semi-related) data.
Three such approaches to this are transfer learning, multi-task (this is technically a subcategory of transfer learning like domain adaptation, but for this article I will treat them as separate entites), and semi-supervised learning. There are other approaches (active learning, meta-learning, entirely unsupervised learning), but this article will focus on ICML articles related to the three (especially the first two). As the boundaries between these areas aren’t always clear, we might venture into some of the others as well. For readers needing review, here is a brief overview. For a more detailed overview, see Sebastian Ruder’s excellent blog post on transfer learning and multi-task learning.
I have always found transfer learning and multi-task learning to be very important tools regardless of industry or domain. Whether you work in medicine, finance, travel, or recreation and whether you work with images, text, audio, or time series data, chances are that you can benefit from taking general pre-trained models and fine-tuning them to your specific domain. Depending on your data, it is also highly likely that there are multiple related tasks that you can train your neural network to learn to solve jointly and hence increase overall performance.
Of particular interest to those who focus on deep learning for medicine (but useful for others as well), was a paper titled "Not to Cry Wolf: Distantly Supervised Multitask Learning Critical Care." In ICU wards there is often a problem of false alarms, so many that nurses/doctors become desensitized to them. This paper focused on detecting the actual life-threatning ICU events instead of the false alarms using multi-task and semi-supervised learning. The paper’s authors looked at using multi-task learning with auxillary tasks to improve the performance of the model without requiring a lot of time spent annotating. Specifically, their model "incorporates a large amount of distantly supervised auxiliary tasks in order to significantly reduce the number of expensive labels required for training." Secondly, they developed a new approach "to distantly supervised multitask learning that automatically identifies a large set of related auxiliary tasks from multivariate time series to jointly learn from labelled and unlabelled data." The video of the talk is availible on YouTube.
What if you want the benefits of multitask learning but only have one task? The paper "Pseudo-task Augmentation: From Deep Multitask Learning to Intratask Sharing – and Back" deals with this issue. The authors propose utlizing pseudo-tasks to help increase the performance of the main-task. This is possible because on a basic level, multitask learning often works by sharing features between intermediate and upper layers and learning task specific decoders for the specific tasks. Hence, training a model with multiple decoders should incur the same benefits even if the decoders are all for the same task because each decoder learns the task in different ways; these additional decoders are called "psuedo-tasks." The authors of the paper achieve SOTA results on the CelebrityA dataset. I was pleased to see they also tested on the IMDB sentiment dataset. They used a baseline model and showed significant improvements by training with their technique. This shows that the technique can potentially work with multiple different neural network architectures.
GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks
This paper describes a new normalization technique for multi-task NNs that help them converge faster and increase overall performance. It also reduces the overall number of hyperparameters needed to tune to just one. Using GradNorm they acheive SOTA on the NYU2 dataset. Overall, this is a solid paper that can help reduce the complexity and difficulties of training MLT algorithms. Finally, the authors make the interesting observation that "GradNorm may have applications beyond multitask learning. We hope to extend the GradNorm approach to work with class-balancing and sequence-to-sequence models, all situations where problems with conflicting gradient signals can degrade model performance."
Transfer Learning via Learning to Transfer

Up to this point most transfer learning papers have only studied transfer knowledge from a source domain to target domain, either by pre-initializng the weights and freezing layers or through decaying the learning rate. This paper can best be described as "meta-transfer learning" or learning how to best perform transfer learning tasks (L2T). The authors describe that:
Unlike L2T, all existing transfer learning studies transfer from scratch, i.e., only considering the pair of domains of interest but ignoring previous transfer learning experiences. Better yet, L2T can even collect all algorithms’ wisdom together, considering that any algorithm mentioned above can be applied in a transfer learning experience
Now this naturally leads to the question of how this is different from "meta-learning." In reality L2T can be seen as a special type of meta-learning; like with meta-learning, it uses past histories to improve how it learns. However, in this context a history refers to a transfer learning task from a source to target domain.

The authors of the paper evaluate there L2T framework on Caltech-256 and sketches datasets. The model improves on previous SOTA results particularly in cases where there are few examples.
I was happy to see "Explicit Inductive Bias for Transfer Learning with Convolutional Networks" get into Icml after (in my opinion unfairly) being rejected from ICLR. This paper describes a way of applying regularization to effectively engage in transfer learning instead of modifying the learning rate. The authors propose several new regularization methods which apply different penalties based on the weights in the pre-trained model. They acheive good experimental results and I’m currently working on applying it to several of my medical imaging models.
"Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks" is primarily a theoretical paper that investigates "curriculum learning," an idiom of learning borrowed from education/psychology that aims to learn more difficult concepts in a progressive and organized fashion. Specifically, the paper looks at the relationship between transfer learning and curriculum learning as well as the relationship between curriculum learning and the order of examples presented for training and its impact on stochastic gradient descent. It is important to note here that this type of transfer is not the same as the other types discussed so far. In this context transfer learning refers to investigating "the transfer of knowledge from one classifier to another, as in teacher classifier to student classifier." Hence, with this type of transfer learning, "it is not the instance representation which is being transferred but rather the ranking of training examples." The authors conclude that the learning rate is always faster with curriculum learning and that sometimes final generalization is improved particularly with respect to hard tasks.
Learning Semantic Representations for Unsupervised Domain Adaptation

One problem in (unsupervised) domain adaptation is aligining between the target and source distribution. Unsupervised domain adaptation is a type of transfer learning . The authors here develop a semantic transfer network that learns representations "for unlabeled target samples by aligning labeled source centroid and pseudo-labeled target centroid." More simply their method aims to align the distributions of the source and target based on minimizing the overall mapping discrepancy between the source and target domains via a semantic loss function. Results include SOTA performance on both the ImageCLEF-DA and Office31 datasets. Their code is availible online by clicking here
Detecting and Correcting for Label Shift with Blackbox Predictors is another interesting paper related to domain adaptation. It focuses on how to detect changes to the y distribution between training and testing, which can be useful particularly in medicine if some epidemic or outbreak of a disease occurs which greatly affects the distribution .
Faced with distribution shift between training and test set, we wish to detect and quantify the shift, and to correct our classifiers without test set labels
The specific topic of the paper is primarily covariate shift. The authors develop several interesting label shift simulations which they then apply on the CIFAR-10 dataset as well as MINST. Their methods are able to greatly increase accuracy compared to the non-corrected model.
Rectify Heterogeneous Models with Semantic Mapping

I found this paper interesting for its incorporation of optimal transport for the purpose of aligning distributions.
Optimal Transport (OT) becomes the main tool in REFORM, which has the ability to align distributions
Altogether, this paper presents original ideas and obtains good results on both synthetic datasets and real world datasets including the Amazon User Click dataset and the Academic paper classification dataset.
These were just a few of the interesting papers from ICML 2018; there are many other great papers. I do hope at a some point to summarize the meta-learning and the rest of the semi-supervised learning papers. I found these papers fascinating as well.
Announcements
I’m still working on finishing up the next article in my series on deploying Machine Learning models to production. In that article I will discuss using SeldonCore and Kubeflow to deploy machine learning models in a scalable way.
Northern New England Data and Analytics is hosting a data meetup on August 15th where we will walk through deploying a recent NLP model with Seldon Core and Kubeflow to make use of it in a chatbot. The Meetup will be streamed on Zoom.