The world’s leading publication for data science, AI, and ML professionals.

The Metrics of Continual Learning

These three metrics are commonly used

Continual learning is a subfield of Machine Learning that deals with incrementally training neural networks on continually arriving data. Crucially, the data cannot be stored entirely and often times no samples at all can be carried over from old tasks. Because the networks are only optimized on the currently available data, they overwrite the old parameters. In overwriting them, old knowledge usually is destroyed, i.e. forgotten.

Photo by Reid Zura on Unsplash
Photo by Reid Zura on Unsplash

To benchmark continual learning, and catastrophic forgetting, several evaluation metrics are used in continual learning research. In this article, I’ll detail the three most commonly used metrics. While I’ll be using classification as an example, the metrics equally apply to other problems, e.g. regression. In case you are new to the topic of continual learning, I recommend you read my previous two articles to get a deeper understanding of the topic. As I’ve done before, I’ll be providing reading recommendations to explore the topic further at the end of the article.

Average Accuracy

The first commonly used metric is average accuracy, often abbreviated as ACC. As the name indicates, it measures the (test-set) accuracy of each task, and then computes the average over the task specific accuracies. Formally is defined as [1]

In the equation, k is the current task and _ak,j denotes the test accuracy on the previous task j (j <= k) after the training on task k.

The following example should make this clearer: assume we are training a network on three tasks, 1, 2, 3. We first train on task 1 and test on all previous tasks. Because there are none, we only test on task 1. Next, we train on data from task 2. We then evaluate on all old tasks. Now, task one is considered a previous task, so we test our network on it. Then, after training on task three, we evaluate on tasks 1 to 3. In the last case, after training, the equation above will be the following sum:

Backwards transfer

Where ACC is used to measure performance, backwards transfer (BWT) is concerned with the performance change of continual learning – i.e., catastrophic forgetting. It measures the test-set performance differences between directly training on a task and after training on subsequent tasks. Formally it is defined as [1]

where the bracketed term denotes the performance differences. In most cases and most research, this metric will be negative. Negative values indicate forgetting: the original performance for a task was better than when subsequent task were trained.

The following example should make it clearer: say we are training on task 1 and directly evaluate on its test-set afterwards, reaching 90% accuracy. After training on subsequent tasks, we later again evaluate our continually trained network on task 1’s test-set reaching 90% accuracy. Computing BWT now simply is 70% -90% equaling -20. Here continually training our network led to catastrophic forgetting.

Note that 0 BWT, meaning now performance differences, is possible. However, positive BWT, indicating retrospective improvement on old tasks (say, 90% to 91%) is extremely challenging, especially without any access to the old data point.

Forward transfer

Both previously introduced metrics measure performance within a continual setup. To quantify whether the continual training itself is beneficial for learning new tasks, one can use the forward transfer measure FWT. Formally FWT is defined as [1]

where hat{a} is the accuracy of a reference model trend solely on task j. Negative FWT values indicate that the sequential training on previous task has not led to a better-than-from-scratch performance.

Example: after training on some previous tasks, we reach a test accuracy of 90% on task j. A separate, randomly initialized model trained solely on task j’s data reaches 80% accuracy. Then, the forward transfer would be +10, indicating that the continual training has been beneficial. Generally, forward transfer is sparingly used in the literature; ACC and BWT are the main metrics.

Conclusion

In this article, I described the three commonly used metrics in Machine Learning. Average accuracy (Acc) measures the test performance, backward transfer (BWT) measures catastrophic forgetting, and forward transfer (FWT) evaluates the effectiveness of continued training compared to task-specific training from scratch. ACC and BWT are commonly used in the literature, where FWT is only used sparingly. In my article, I focused on classification as the underlying problem, but the metrics are also applicable to regression or object detection.

To explore the topics further, I recommend the following papers (titles gives):

  1. "Gradient Episodic Memory for Continual Learning
  2. "Forget-free Continual Learning with Winning Subnetworks"
  3. "Three scenarios for continual learning"

References

[1] Lopez-Paz, David, and Marc’Aurelio Ranzato. "Gradient episodic memory for continual learning." Advances in neural information processing systems 30 (2017).


Related Articles