Introduction
In this post, we aim to introduce an interactive visualization tool that helps users better understand how and why traditional Active Learning works. This work regards a paper we recently published at NeurIPS 2021 Human-Centered AI workshop.
Human-Centric Machine Learning
In the past decades, Machine Learning has become the buzzword. Many recent advances have been focused on making models more accurate and efficient. Meanwhile, it is crucially important for humans to be able to explain why they work, especially in some sensitive domains such as autonomous driving, financial investing, and healthcare management. If we cannot convince ourselves why these intelligent machines work well, domain experts will not fully trust their predictions.
Unfortunately, there are still many Machine Learning applications where state-of-the-art models can achieve strong predictive power, but the gain in accuracy sadly comes at the cost of transparency, and the decision reached lacks interpretability. Only recently have we begun seeing a shift towards making advanced algorithms interactive and explainable. In other words, we would like to amplify and augment human abilities, aiming to preserve human control and make Machine Learning more productive, enjoyable, and fair. This idea is supported by several well-known institutions, such as Stanford Human-Centered AI Institute and Google People&AI Research.
Active Learning

If you have read my previous posting about Active Learning in R, you would know that my research enthusiasm is to promote the interpretability of Active Learning. Here, I would like to quickly refresh what Active Learning is: an interactive approach that queries an oracle (usually a human annotator) to label the most "valuable" data, in order to train an accurate Machine Learning model with the least labeled examples.
The majority of theoretical work in Active Learning has taken place in classification problems, with many fundamental and successful algorithms developed, such as uncertainty sampling and query by committee. To evaluate active learning, the most common approach is to plot accuracy by the number of queries, where we expect accuracy to improve as more samples are queried. Not limited to classification, recent studies also adapted some similar algorithms for regression, and the standard performance measure is Mean Squared Error (MSE).
Issue: many active learning papers presented highly similar accuracy/MSE curves, yet several contradictory findings are reported. The information reflected from accuracy/MSE plots is both limited and potentially biased to users.

For example, the CIFAR10 experiment (first graph) from a recent CVPR paper shows almost identical performances for all querying methods (except "Random"). It is a fair hypothesis that even though the curves are overlapped, these Active Learning methods do not improve the classification model identically. One supportive evidence is that they perform diversely in the other experiments. But the question is: how could we test it and actually see whether different algorithms work differently or not?
Our Method & Experiment
In the paper we published at NeurIPS 2021 Human-Centered AI Workshop, we proposed a novel interactive Visualization tool for people to better understand why and how Active Learning works on certain classification and regression tasks.
In this section, we will briefly share how the insights are built with a vivid example (don’t worry, there won’t be any too technical or intricate definitions!).
PCA for Test Dataset
In a machine learning experiment, accuracy or MSE is generated based on the initially divided test set. For example, "accuracy = (# of accurately predicted samples / # of all samples in the test set)." Hence, the first task is to lay out all test samples and create a 2-D feature embedding for visualization. Dimension reduction techniques like PCA can help.
Experimental Settings
Dataset: In our paper, we conducted an Active Learning Regression problem with real-world data: CASP. It originally contains 45,730 instances and we randomly selected 9,730 (21%) in the test set.
Machine Learning Model: the regressor is a Purely Random Tree model, specifically, Mondrian Trees from a paper in NeurIPS 2019.
Active Learning Process: we started with an empty training set, batch queried 500 at each iteration, and stopped after 15 batches (7,500 in total). Three basic querying strategies were considered: tree-based active algorithm (al), uncertainty sampling (uc), and random sampling (rn).

The PCA plot demonstrates obvious clustering patterns among all test samples. More importantly, users can select a proportion (group) of interesting test data points with the PCA plot.
The MSE-Query plot on the left shows an ideal decreasing trend. The active Sampling strategy is the best, but uncertainty sampling seems to be competitive with more than 3,500 samples being labeled.
Interactive Visualization Panel
In our case, we would like to answer the question: is the Active Sampling strategy similarly effective as uncertainty sampling? This question can be directly explained by how prediction values changed in each querying iteration, motivating us to build an interactive visualization tool.
We arrange prediction values to a 2-D mesh-grid plot: x-axis represents the querying process (unit could be single or batch queries); y-axis represents the indices of selected test samples; each pixel shows prediction differences according to three criteria, which are:
1. current model vs. original model: If the model is progressively improving and learning from the queried instances, we expect the prediction values to be gradually more and more different. This can be reflected by more colorful mesh-grid graphs.
2. current model vs. previous model: For some strategies, we expect later queried samples to bring less effect on changing the model. This can be reflected by lighter colors on the plot.
3. current model vs. ground truth: As querying more samples, we expect the model to make better predictions that are closer to ground truth. This can also be reflected by a lighter plot.
(These criteria have rigorous definitions. Please view our paper if you are interested in learning more!)

Suppose we select a group of small value points from the PCA plot (marked by red points). 3 ∗ 3 panel plots are provided for the prediction-change plots of three strategies ("al", "uc", and "rn") in each row.
Observations
Looking at the first prediction-change plots across three querying algorithms, it is fairly clear that they perform differently in changing the regressor model, even though they have similar MSE curves in the previous plot. "al" performs more regularly than "uc" or "rn", with consistently larger values (18 out of 20 purely red rows) or smaller values (1 out of 20 purely blue rows) compared to the initial model.
In the second prediction-change plot of "uc", we see dark pixels at first and quite light, nearly white color pixels after 5 query batches. This suggests that "uc" tends to change the model dramatically at the first stage and has less effect in later queries.
The third prediction-change plots consistently showed red pixels, which intuitively makes sense because the dark points we selected have small truth values near 0. The tree regressor tends to fit the structure with larger predictions than the truth.
However, unlike this diverse performance on the low-value group, three algorithms show almost uniform performances for the large value group.

Discussion
This is just a toy example we illustrated in our initial released paper. We demonstrated how the visualization panel tool works and provided some interesting insights based on the CASP experiment. So far, we have only run a few empirical experiments. In fact, there are many more that this tool could reveal, for example:
- Does certain Active Learning method always work well on some data subgroups?
- Does the tree model learn well From queried samples for fitting large value points (or, data points from a sparse area)
……
Although we have already started using our tool on other different data sets, there is more work that needs us to do to answer these questions rigorously. Additionally, more in-depth analyses on the properties of different Active Learning strategies are required. They are associated with a few quite essential terminologies in Active Learning: informativeness, representativeness, and diversity. In the future, we hope to expand this project and publish some more valuable work at some main conferences.
Limitations
There are also several issues, both intrinsic and pointed out by reviewers as well as users. The main one we noticed is that the performance of the visualization tool relies on some clear association of the test samples on the PCA plots: it is easier to draw conclusions if there are clear clusters shown. If test samples do not show any clear association or cluster groups in the dimension reduction plot, the AL task will be more complicated because of no clear decision boundaries, which further makes our plots less distinguishable.
Summary
Nevertheless, we are pleased to see that Human-Centered Artificial Intelligence and Machine Learning are receiving increased attention from researchers. We do hope that this visualization tool could be approachable, usable, and helpful for users to test along with traditional accuracy/MSE plots. Not only are its definitions straightforward, but it is also flexible to be adapted to almost all classification and regression problems.
Acknowledgment
I want to thank my undergrad honours supervisor, Dr. Martin Ester for his guidance in completing this part-time project. I am also greatly thankful to Jialin Lu and Oliver Snow, for their consistent help during the past eight months at Simon Fraser University.
Sources
A GitHub demo for this tool is available at: https://github.com/AndyWangSFU/Active_Learning_Visualization_Demo. If you like this post, please leave a star and it is greatly appreciated!