skip to main content
research-article
Open Access

Continual Recognition with Adaptive Memory Update

Published:24 February 2023Publication History

Skip Abstract Section

Abstract

Class incremental continual learning aims to improve the ability of modern classification models to continually recognize new classes without forgetting the previous ones. Prior art in the field has largely considered using a replay buffer. In this article, we start from an observation that the existing replay-based method would fail when the stored exemplars are not hard enough to get a good decision boundary between a previously learned class and a new class. To prevent this situation, we propose a method from the perspective of remedy after forgetting for the first time. In the proposed method, a set of exemplars is preserved as a working memory, which helps to recognize new classes. When the working memory is insufficient to distinguish between new classes, more discriminating samples would be swapped from a long-term memory, which is built up during the early training process, in an adaptive way. Our continual recognition model with adaptive memory update is capable of overcoming the problem of catastrophic forgetting with various new classes coming in sequence, especially for similar but different classes. Extensive experiments on different real-world datasets demonstrate that the proposed model is superior to existing state-of-the-art algorithms. Moreover, our model can be used as a general plugin for any replay-based continual learning algorithm to further improve their performance.

Skip 1INTRODUCTION Section

1 INTRODUCTION

The ultimate goal of Artificial Intelligence, particularly machine learning, is to empower machines with the abilities of humans. This work focuses on imitating humans’ ability to continuously learn from experience—that is, lifelong learning. Humans can utilize the previously learned skills to accomplish new tasks and memorize the acquired knowledge as the basis for future learning without forgetting previous skills. However, current machine learning algorithms perform far less competitively than humans in this aspect, which has prompted fast-growing research on continual recognition. Continual learning [8] aims to learn several tasks sequentially while maintaining good performances on all the tasks learned so far, which is quite challenging. This article focuses on class incremental continual recognition, where each task is classification on several classes, and task ID is unavailable during inference.

Some continual learning methods merge the data from new tasks and existing tasks and then train the model on them, which is quite time consuming in most cases. Others fine-tune the trained model on the new task, which suffers from the fact that knowledge obtained from old tasks will be overwritten upon training on a new task, resulting in poor performances on old tasks. This phenomenon is referred to as catastrophic forgetting. Existing works to solve the catastrophic forgetting problem on continual learning can be divided into three categories: (i) parameter isolation, dedicating different model parameters to each task, which is restricted to the task incremental setting; (ii) regularization, adding certain restrictions during model updates to prevent the model from “forgetting” the learned knowledge of old tasks, which, although it can mitigate the catastrophic forgetting problem to some extent, suffers from a performance drop as the number of tasks continuously increases; and (iii) replay, partially memorizing data from old tasks as exemplars and mixing them with the new data for training, which has the best performance among the existing literature but can still inevitably “forget” knowledge obtained previously. Since replay methods achieve the state of the art, we choose to design our owned model based on replay.

Most of the preceding models focus on the training stage and design different methods to prevent forgetting. However, forgetting always happens, even for humans. The Ebbinghaus forgetting curve [26] shows that learners will forget about 90% of what they have learned within the first month. Due to the inevitability of forgetting, how to fix the model after forgetting happened needs to be considered. To the best of our knowledge, we are the first to remedy the machine learning model after forgetting happened. As Figure 1 shows, one major reason for forgetting from replay methods is that the stored exemplars are not sufficient to help discriminate between the coming classes and previously learned classes. As no information is provided about the future classes, we could only tackle this issue after we meet the new classes. Inspired by the baby who is born with the ability to learn and recognize continually [35], we assume that most learned samples are stored in long-term memory. As forgetting or interference happen, they are retrieved efficiently to solve the issue.

Fig. 1.

Fig. 1. Motivation of adaptive memory update. The triangles represent the samples from a previously learned class (alpaca), and the rectangles represent a new class (sheep). The dashed line is the learned decision boundary. (a) The stored exemplars (dark triangles) are not hard enough to obtain a good decision boundary, and thus some alpacas will be misclassified as sheep. (b) After some exemplars were replaced with samples more similar to sheep (harder exemplars), we can obtain more discriminative decision boundary.

Based on the preceding observation and assumption, we propose a continual learning method with adaptive memory update for class incremental continual learning. The proposed method only preserves a small number of exemplars in working memory for each task learned. By conducting half-and-half validations on exemplars, our method can adaptively determine which classes are experiencing severe forgetting or interference. By exchanging data points between current exemplars set and long-term memory in a non-trivial manner, our method can adaptively maintain a group of exemplars that best help memorize the forgetting classes. Meanwhile, the computational complexity of our model for a single task will not be increased. Moreover, our proposed method can serve as a general plugin for any replay-based approach to further improve their performance. We report our experimental results on comparisons between our method and several baselines on CIFAR-100 and mini-ImageNet, showing that our method can significantly outperform state-of-the-art methods. Figure 2 shows the process of our proposed continual learning method.

Fig. 2.

Fig. 2. Illustration of the learning process. After learning a task, our model conducts an adaptive forgetting check for each learned class. If the model performs well and no classes experience severely forgetting, the model could continue to learn the next task (red dashed line). If the model performs poorly and some classes are forgotten, the model needs to conduct memory update on the forgetting classes before learning the next task (gray dashed line).

Our contributions are summarized as follows:

  • We propose using adaptive memory update to remedy the machine learning model after forgetting, which is the first work with such mechanism to the best of our knowledge.

  • We propose a novel continual recognition model, which is capable of adaptively discovering classes being severely forgotten or interfered with, and then conducting a memory update on these classes through exchanging data points between the current exemplars set and long-term memory.

  • We conduct extensive experiments on several real-world datasets to validate our proposed method’s superiority over existing baseline approaches.

The rest of the article is organized as follows. We discuss related works in Section 2 and explain the details of our proposed model in Section 3. In Sections 4 and 5, we conduct extensive experiments on various real-world datasets to show the proposed model’s advantages against existing state-of-the-art approaches. We finally conclude the article and point out future research directions in Section 6.

Skip 2RELATED WORK Section

2 RELATED WORK

Continual learning, also known as lifelong learning, is continuously acquiring, modifying, and transferring knowledge and skills. It remains a considerable challenge for machine learning and neural networks. In recent years, much work [4, 21, 27, 30, 36] has been proposed to address the catastrophic forgetting problem. The existing methodologies for continual learning can be divided into three categories: parameter isolation-based methods [1, 15, 22, 24], regularization-based methods [10, 18, 23], and replay-based methods [5, 7, 14, 31, 39]. Our work is based on replay.

2.1 Parameter Isolation Based Methods

Parameter isolation based methods [22, 24] dedicate different subsets of the model parameters to each task. When a new task arrives, this kind of method trains a new branch of network with parameters from the previous task (or sometimes entirely new). After the training of this task, the branch is determined to be used for the prediction of this task and can be reused in a layer-wise or neuron-wise way for future tasks. Some methods select a new branch from the original architecture [24], whereas others increase the model capacity to grow a new branch [22].

Most of these works require a task identifier to activate the corresponding branch during prediction, and it impairs the performance of the model significantly. The catastrophic forgetting problem still exists when the model capacity is limited.

2.2 Regularization-Based Methods

Regularization-based methods limit how far the parameters can move from values that were optimal for previous tasks. This is usually implemented via additional terms in the loss function. Some methods discourage the updating of essential parameters for past tasks [18]. They determine essential parameters in the current task first and then penalize the change to these parameters in future training for the new tasks. In the training phase for a new task, they use the output probabilities for each image on the old task as a soft label, and the updated model’s output is forced to be close to these soft labels [23].

To some extent, regularization-based methods are a simple way to mitigate the problem of catastrophic forgetting. However, the constraint is still insufficient to counter the accumulation of errors in old tasks, and the drop in performance is inevitable as the number of classes increases.

2.3 Replay-Based Continual Learning

Replay-based continual learning methods, also known as rehearsal methods, need a memory component to store samples from the previous task. The stored samples are often in their raw format, known as exemplars. These previous task samples are replayed while learning a new task to alleviate forgetting. This memory component plays a role like the hippocampus of the complementary learning theory [25].

With the stored samples, the replay-based methods perform pretty well in continual learning. Nevertheless, which samples to store remains a challenge. Several strategies [16, 31] have been proposed for selecting the samples.

Recently, some works have tried to use generative models to generate high-quality samples instead of storing them [34, 38]. However, this also leads to a challenge of training the generative model continually.

Most replay-based methods pick samples randomly from the exemplar set while training, which is not optimal. Aljundi et al. [2] use a selective replay technique that retrieves the most disturbed samples from the exemplar set each time. Shim et al. [33] try to retrieve those exemplars that would be most helpful for learning. Inspired by game theory, they use the Shapley value to measure the extent to which the samples contributed to the learning.

How to store the exemplar set more efficiently is of equal interest. Riemer et al. [32] use a discretized variational self-encoder to compress the stored exemplar set to save storage costs. Caccia et al. [6] use a variational self-encoder with adaptive vector quantization to compress the exemplar set.

2.4 Class Incremental Scenario

There are different settings of continual learning [37]. In this article, we focus on class incremental continual learning, where the model does not have access to the task-ID at inference time and therefore must be able to distinguish between all classes from all tasks. It is a much more difficult scenario.

Skip 3PROPOSED METHOD Section

3 PROPOSED METHOD

In this section, we first formulate the class incremental continual learning problem in Section 3.1, and then we describe our main components in Sections 3.2 through 3.4.

3.1 Problem Formulation

In class incremental continual learning, a model experiences a sequence of classification tasks denoted by \(\mathcal {T} = [(C_1,D_1), (C_2,D_2), \dots , (C_T,D_T)]\), where each task t is represented by a set of classes \(C_t = \lbrace c^1_t,c^2_t,\dots \rbrace\) and training data \(D_t\). T is the total number of tasks and is not a priori. Each training data \(D_t\) contains a number of input-target pairs \((x_i^t, y_i^t)\), which is identically and independently drawn from an unknown distribution. Here, \(x_i^t\) represents the i-th input example in task t and \(y_i^t \in C_t\) represents its class label. We use \(N_t\) to represent the set of total classes in all tasks up to and including task t: \(N_t = \mathop {\cup }\nolimits _{i=1}^t C_i\) Usually, different tasks contain different classes, and hence the model needs to recognize more and more classes during the training phase.

We denote our model as \(M_\theta : \mathcal {X} \rightarrow \mathcal {Y}\), composed by a feature extractor \(f: \mathcal {X} \rightarrow \mathbb {R}^d\) and a classifier \(g_\mathcal {W} : \mathbb {R}^d \rightarrow \mathcal {Y}\). Here, f can be any convolutional neural network, depending on the complexity of the dataset. The parameters \(\mathcal {W}\) of g is a set of weight vectors \(\lbrace w_1, w_2, \dots , w_k\rbrace\), and k is the number of classes learned so far. When our model finishes training on one task and is ready for learning a new task, f and \(w_i\) would be temporarily saved as \(\hat{f}\) and \(\hat{w_i}\) for distillation loss. Meanwhile, classifier \(\mathcal {W}\) will expand, and several new weight vectors will be added corresponding to the classes in the new task.

The model gets to learn tasks sequentially, and it is worth mentioning which tasks the samples belong to is not provided in the inference phase.

3.2 Long-Term Memory

When forgetting happens and we want to do something to remedy it, the first thing to determine is what has been forgotten. There are usually two situations for humans. One is that humans could tell what they have forgotten through their own perception. The other one is that humans become confused or make mistakes. The latter situation is quite similar to the catastrophic forgetting of our machine learning model. After determining what has been forgotten or interfered with, humans could review them from some learning resources like the library or the Internet, or even their memory if they have a good one. Since there are no such resources for our machine learning model, we choose to store the learning material, i.e., the training data, into a long-term memory component.

Someone may argue that most existing works follow a rule that the training data for previously learned tasks is unavailable. There are two main reasons for this rule: one is the privacy issue, and the other is storage constraints. However, Knoblauch et al. [19] claim that the optimal continual learning model needs perfect memory. Bartol et al. [3] estimate that the storage capacity per synapse is roughly 4.7 bits of information, and this implies that the total memory capacity of the brain, with its many trillions of synapses, is much larger than the size of our continual learning model.

These findings make us wonder whether this rule is still necessary in today’s world of reliable encryption and anonymity technology, and low storage costs. In this article, we relax the rule in a way that the training data is stored in a long-term memory component and can be accessed depending on demand.

3.3 Exemplar Management

Our model maintains a collection of exemplars \(\mathcal {E} = \lbrace e_1, e_2, \dots , e_k\rbrace\) during training as working memory. When the model finishes training for a task, a few exemplars of each new class will be selected and added to the exemplar set. The data from the collection will be involved in the training on future tasks later.

Some methods use an exemplar set of a fixed size. Every time new exemplars are added, the number of old class exemplars needs to be reduced to meet the fixed size limit. Since the number of tasks is unknown, it is difficult to determine a proper size for the exemplar set at the very beginning. Our model maintains the exemplar set without the constraint on fixed memory size and uses a constant number of exemplars for each class instead. In this way, the exemplar set’s size will increase linearly as the number of classes increases. Since the number of exemplars for each class is a small constant, the problem of increasing size is affordable compared to the improvement brought by the exemplar set.

We use herding selection to select exemplars. Herding selection [31] is a greedy algorithm for selecting new exemplars for one class. This algorithm iteratively selects exemplars from training data and makes the mean feature vectors of exemplars close to the mean feature vectors of training data until the exemplar set size is met.

3.4 Adaptive Memory Update

To conduct a memory update, there are two fundamental problems to be solved. One is when to update, and the other is how to update.

As we will see later, the drop in model performance varies significantly from class to class, telling us that the probability of a class being forgotten is quite different. Here we let the unit for memory update be one single class rather than all classes in one task. In the following discussion, we are going to focus on a particular class c.

When to Update. The time for our model to update is when the model finds itself unable to distinguish a class from other classes—that is, performing poorly on the data from this class. However, it is difficult for a model to obtain this kind of ability and evaluate its own performance autonomously. We tackle this issue through conducting a half and half validation procedure on our exemplar set.

Before training a new task, the data from each class’s exemplar set is randomly divided into two parts. One part is then used as training data to participate in model training together with the data from the new task. After the training is completed, our model would be tested on the other part, which we call half and half validation.

If our model performs poorly on the other part for a particular class, our model will update the exemplar set for this class. The criterion we adopt for poor performance is whether the recall score is lower than a threshold \(\lambda\). Any other reasonable criterion can also be used.

Based on the preceding mechanism, our model could determine whether a class c needs a memory update.

How to Update. After training a task, we assume that all the training data is not discarded but instead is stored in a long-term memory component. A simple and brute-force update approach is to take all the data from class c into the training procedure. This method, however, requires a lot of access operations on long-term memory, and the training complexity is greatly increased, especially when there are many classes that need an update.

To address this difficulty, we propose a simple and elegant update approach. We replace the exemplar set with the same number of data samples that are randomly selected from long-term memory. In this way, the long-term memory component only needs to support random access operation by class. Besides, the training complexity with the memory update is not increased at all. Figure 3 illustrates the process of the adaptive memory update.

Fig. 3.

Fig. 3. Adaptive memory update (AMU). After training on the task i, some samples will store in the exemplar set (dashed line), whereas others will be stored in long-term memory (dotted line). In the memory update stage, the same number of samples as the exemplar set are taken from long-term memory to replace the original exemplars.

3.5 Loss Function

Our loss function contains two terms: distillation loss \(L_{distill}\) and classification loss \(L_{clf}\).

Distillation Loss. Distillation loss was originally proposed to transfer knowledge between different neural networks [13]. Here we use it to maintain the output of our model on the old tasks while training the model on a new task. When our model is training on task t, the distillation loss \(L_{distill}\) is computed as follows: \(\begin{equation*} \mathcal {L}_{distill} = -\sum _{(x_i, y_i \in D_t)} \sum _{j \in N_t} q_{ij} \ log \ p_{ij} + (1-q_{ij}) \ log\ (1-p_{ij}) \end{equation*}\) And \(q_{ij}\) is computed as: \(\begin{equation*} q_{ij} = {\left\lbrace \begin{array}{ll} h_j(x_i) & j \in N_{t-1} \\ y_i & j \in N_{t} - N_{t-1} \end{array}\right.} \end{equation*}\) Here, \(p_{ij}\) represents the probability of the i-th sample belonging to class j, and \(h_j(x_i)\) is the output from the old model before training on task t: \(\begin{eqnarray*} p_{ij} = sigmoid\left(\frac{w_j^\top }{||w_j||_2} f(x_i)\right)\\ h_j(x_i) = sigmoid\left(\frac{\hat{w}_j^\top }{||\hat{w}_j||_2} \hat{f}(x_i)\right) \end{eqnarray*}\)

Here we adopt a l2-normalized form of weight vector to produce logits, which is useful and practical for solving the class imbalance issue in continual learning [14].

Classification Loss. In the traditional multi-class classification task, cross-entropy loss with softmax activation is the most commonly used loss. When considering distillation loss, we find that binary cross-entropy with sigmoid is a better choice than cross-entropy with softmax, because the soft label in distillation loss is more analogous to a multi-label target than a multi-class one. As we can see, the form of distillation loss is binary cross-entropy with sigmoid activation. For consistency, the binary cross-entropy loss is directly performed on the exemplar set. \(\begin{equation*} \mathcal {L}_{clf} = -\sum _{(x_i, y_i \in \mathcal {E})} \sum _{j \in N_{t}} \delta _{y_i=j} \ log \ p_{ij} + \delta _{y_i\ne j} \ log\ (1-p_{ij}) \end{equation*}\) \(\delta\) is an indicator function. Finally, our loss function is calculated as: \(\begin{equation*} \mathcal {L} = \mathcal {L}_{distill} + \mathcal {L}_{clf} \end{equation*}\)

The overall incremental training process with the proposed method is presented in Algorithm 1.

Skip 4EXPERIMENTS Section

4 EXPERIMENTS

4.1 Datasets

We compare the proposed method with several baselines on two widely used image datasets for class incremental continual learning: CIFAR-100 and ImageNet-Subset.

CIFAR-100 [20] consists of 60,000 samples of 32 \(\times\) 32 color images in 100 classes, with 600 images per class. The official training set is used as our training data, and the rest is for testing. In this way, each class has 500 training and 100 test samples.

ImageNet-Subset is a subset of ImageNet [9] with only 100 classes, randomly sampled from the original 1000. We also use the official split of training and test, where each class has about 1,300 training and 50 test samples of 224 \(\times\) 224 color images.

4.2 Baseline Methods

For a fair comparison, we choose two state-of-the-art methods that use exemplars as baselines. iCaRL [31] is a widely used baseline, and it uses a distillation loss and a nearest-mean-of-exemplars classification strategy. BiC [39] aims to solve the imbalance problem and proposes a bias correction layer. Besides, we report the performance of the Base method that only uses exemplar sets and cross-entropy loss. More specifically, we respectively report the results of CNN predictions and nearest-mean-of-exemplars classification, denoted as CNN and NME for the suffix. Some papers denote it as NCM (nearest-class-mean) instead of NME.

4.3 Protocol

Some work starts with a network trained on a large number of classes and then learns several classes per task incrementally. This setting might give an added advantage to scaling/bias correction methods [29]. So in our article, we divide the classes into several tasks of equal size.

For CIFAR-100, we preserve 50 exemplars for each class, and for ImageNet-Subset, we preserve 100 exemplars per class. The class order plays an important role, and hence we run experiments five times on a random but fixed class order. After each task, the resulting classifier is evaluated on the test data of the dataset, considering only classes that have already been trained. The result of the evaluation is a curve of the classification accuracies after each task. The average of these accuracies is also reported as incremental average accuracy.

4.4 Implementation Details

All compared models are implemented with PyTorch [28] and trained on TITAN-X GPUs. We adopt ResNet [12] as the convolutional network backbone to extract features for all models. The training images are randomly flipped and cropped as data augmentation. The threshold \(\lambda\) we set for the update is 0.7. For CIFAR-100 and ImageNet-Subset, our model is trained by Adam [17] for 70 epochs with the batch size 128 and the initial learning rate 0.001 while other baselines are fine-tuned to gain their optimal performance. It is worth mentioning that all results are from our implementation for reproducibility.

Skip 5RESULTS AND ANALYSIS Section

5 RESULTS AND ANALYSIS

We run our experiments on 5, 10, and 20 tasks with 20, 10, and 5 classes per task. Five random class orders are used, and the mean of incremental average accuracy is reported.

Figures 4, 5, 6, and 7 show the results of CIFAR-100 and ImageNet-Subset using NME inference and CNN inference, respectively. One can see that our proposed method significantly beats other approaches, and the larger number of tasks we have, the more ours outperforms other baselines, indicating that our model is more adaptable for real lifelong learning.

Fig. 4.

Fig. 4. Evaluation on CIFAR100 using CNN inference.

Fig. 5.

Fig. 5. Evaluation on CIFAR100 using NME inference.

Fig. 6.

Fig. 6. Evaluation on ImageNet-Subset using CNN inference.

Fig. 7.

Fig. 7. Evaluation on ImageNet-Subset using NME inference.

General Plugin. Table 3 shows the effect of using the adaptive memory update as a plugin among different models. Ours benefits from this update and obtains the best improvement when using NME to make the inference. However, iCaRL-NME’s performance drops a little bit after using this plugin. This is because iCaRL-NME’s distillation loss uses the model’s previous output as targets for the exemplar set, resulting in a steady accumulation of old classes’ errors and the failure of the plugin. Other methods, more or less, have been improved by the plugin.

Memory Size. Table 2 shows the effect of the different memory sizes. Fixed 2000 means using an exemplar set with fixed capacity, and in this way, the more classes stored, the fewer exemplars reserved for each old class. All methods demonstrate a significant increase in incremental average accuracy using a larger size of the exemplar set. Ours consistently performs best under different memory sizes.

Criterion for Update. Figure 8 shows that the drop in model performance varies significantly from class to class, indicating that the probability of forgetting a class is different. In our proposed model, the criterion according to which a class needs an update is that the recall of the model for the target task is below 0.7. Figure 9 (left) shows how this threshold influences the performance. We conduct experiments on CIFAR100 for 10 tasks with 50 exemplars for each class. We can see the slope of this curve is going up, indicating that as the number of updates increases (a larger threshold tends to increase the probability that a class needs update), the model’s gain from the update increases. If our access to the long-term memory is fast enough, we can update all the classes we have learned to achieve better performance.

Fig. 8.

Fig. 8. The recall of different classes after each task. Each small rectangle located at \((i,j)\) in the figure represents the recall of class j after finishing training task i, and the darker color indicates a lower recall value. The red box located at \((i,j)\) represents that class j needs update after task i. The drop in recall varies significantly from class to class, as the model is trained on more and more tasks. Some classes, such as class 5, remain high recall since the beginning, whereas recalls of other classes drop rapidly to low levels after one or two tasks, such as classes 42 and 52. We observe that recalls of some classes, including class 20, return to a high level after memory update, whereas other classes, such as class 52, may require repeated updates. This figure is generated from 10 classes per task experiment on CIFAR100.

Fig. 9.

Fig. 9. The effect of threshold.

Access Overheads. However, there is always an overhead for accessing the long-term memory. Sometimes this becomes a non-negligible constraint on our method. Figure 9 (right) shows the performance when we limit the access times. The access limit means how many classes we can update after one task. As we can see, the performance first rises rapidly and then slowly as the access limit increases, which shows that it is enough to update a few classes for impressive improvement.

Update Strategy. In this work, we use a simple and elegant memory update strategy. We replace exemplars with randomly selected samples in the long-term memory. In addition to this method, we tried other strategies based on different sample selection methods. The results are shown in Table 1 . Random is our proposed strategy. Nearest strategy means selecting those samples that are close to the misclassified exemplars in feature space. Herding strategy uses herding selection to select samples. Testing strategy means selecting those samples from the external database that are misclassified. To select the most representative samples, Kmeans strategy selects the cluster centers after conducting the standard Kmeans clustering algorithm. GNG strategy selects samples via growing neural gas algorithm [11], which helps maximize coverage of the feature space.

Table 1.
CNNNME
Herding68.2768.65
Random67.9568.47
Testing67.3766.40
Kmeans67.5866.34
GNG67.8166.21
Nearest66.8965.73

Table 1. Using Different Update Strategies

Table 2.
Fixed 20002050100150
Ours-NME67.7364.9368.4770.6572.74
Ours-CNN67.0364.0067.9470.4772.45
iCaRL-NME65.2062.5164.9967.0268.88
iCaRL-CNN54.4549.9062.5158.5360.25
BiC-NME58.8450.6256.5559.5160.60
BiC-CNN57.2147.2958.2960.2461.80
Base-NME54.7447.4160.4060.2061.89
Base-CNN53.2944.2953.6859.4061.74
  • Fixed 2000 in the header means using an exemplar set with a fixed capacity of 2,000 samples for all classes. Others in the header mean the number of samples the model stored for each class.

Table 2. Effect of the Memory Size

  • Fixed 2000 in the header means using an exemplar set with a fixed capacity of 2,000 samples for all classes. Others in the header mean the number of samples the model stored for each class.

Table 3.
w/o AMUw/ AMU
Ours-NME67.0068.22
Ours-CNN67.1967.76
iCaRL-NME64.7364.54
iCaRL-CNN61.5662.15
BiC-NME56.5558.80
BiC-CNN58.2959.28

Table 3. Effect of Using Adaptive Memory Update (AMU) as a Plugin

Exemplars play a crucial role in replay-based methods. The model needs the exemplars to find the best distinguishing feature for the classes learned before. Furthermore, when the model uses NME to infer, exemplars need to approximate the means of classes in feature space. That is why Herding performs best, followed by Random strategy by a narrow margin; Nearest performs worst in both settings; Testing performs poorly using NME but is not bad using CNN. It is noted that these methods, in addition to Random, require additional computation to obtain samples’ features, classification results, or cluster centers, but finally come to a comparable result or worse. So we choose Random as our update strategy for efficiency.

Skip 6CONCLUSION Section

6 CONCLUSION

In this article, we proposed a novel method using an adaptive memory update mechanism and a novel loss to tackle the catastrophic forgetting problem in class incremental continual learning. As far as we know, it is the first time that the concept of remedy has been brought into the field of continual learning. Experimental results on CIFAR-100 and ImageNet-Subset demonstrate that the proposed method achieves better performance than existing state-of-the-art continual learning algorithms. This work also starts a discussion on bringing humans’ mnemonic methods and designing human-like models to solve catastrophic forgetting in continual learning.

REFERENCES

  1. [1] Abati Davide, Tomczak Jakub, Blankevoort Tijmen, Calderara Simone, Cucchiara Rita, and Bejnordi Babak Ehteshami. 2020. Conditional channel gated networks for task-aware continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 39313940.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Aljundi Rahaf, Belilovsky Eugene, Tuytelaars Tinne, Charlin Laurent, Caccia Massimo, Lin Min, and Page-Caccia Lucas. 2019. Online continual learning with maximal interfered retrieval. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS’19). 11872–11883.Google ScholarGoogle Scholar
  3. [3] Jr. Thomas M. Bartol, Bromer Cailey, Kinney Justin, Chirillo Michael A., Bourne Jennifer N., Harris Kristen M., and Sejnowski Terrence J.. 2015. Nanoconnectomic upper bound on the variability of synaptic plasticity. eLife 4 (2015), e10778.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Belouadah Eden and Popescu Adrian. 2018. DeeSIL: Deep-shallow incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle Scholar
  5. [5] Belouadah Eden and Popescu Adrian. 2019. IL2M: Class incremental learning with dual memory. In Proceedings of the IEEE International Conference on Computer Vision. 583592.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Caccia Lucas, Belilovsky Eugene, Caccia Massimo, and Pineau Joelle. 2020. Online learned continual compression with adaptive quantization modules. In Proceedings of the International Conference on Machine Learning. 12401250.Google ScholarGoogle Scholar
  7. [7] Castro F. M., Marín-Jiménez M., Mata Nicolás Guil, Schmid C., and Karteek Alahari. 2018. End-to-end incremental learning. In Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11216. Springer, 241–257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chen Zhiyuan and Liu Bing. 2018. Lifelong Machine Learning (2nd ed.). Synthesis Lectures on Artificial Intelligence and Machine Learning.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L.. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).Google ScholarGoogle Scholar
  10. [10] Dhar Prithviraj, Singh Rajat Vikram, Peng Kuan-Chuan, Wu Ziyan, and Chellappa Rama. 2019. Learning without memorizing. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 51385146.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Fritzke Bernd et al. 1994. A growing neural gas network learns topologies. In Proceedings of the 7th International Conference on Neural Information Processing Systems (NIPS’94). 625–632.Google ScholarGoogle Scholar
  12. [12] He Kaiming, Zhang X., Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770778.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Hinton Geoffrey, Vinyals Oriol, and Dean Jeffrey. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop.Google ScholarGoogle Scholar
  14. [14] Hou Saihui, Pan Xinyu, Loy Chen Change, Wang Zilei, and Lin Dahua. 2019. Learning a unified classifier incrementally via rebalancing. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hung Ching-Yi, Tu Cheng-Hao, Wu Cheng-En, Chen Chien-Hung, Chan Yi-Ming, and Chen Chu-Song. 2019. Compacting, picking and growing for unforgetting continual learning. In Advances in Neural Information Processing Systems. 1366913679.Google ScholarGoogle Scholar
  16. [16] Isele David and Cosgun Akansel. 2018. Selective experience replay for lifelong learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2015).Google ScholarGoogle Scholar
  18. [18] Kirkpatrick James, Pascanu Razvan, Rabinowitz Neil, Veness Joel, Desjardins Guillaume, Rusu Andrei A., Milan Kieran, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 13 (2017), 35213526.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Knoblauch Jeremias, Husain Hisham, and Diethe Tom. 2020. Optimal continual learning has perfect memory and is NP-hard. In Proceedings of the International Conference on Machine Learning.Google ScholarGoogle Scholar
  20. [20] Krizhevsky Alex. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.Google ScholarGoogle Scholar
  21. [21] Lee Kibok, Lee Kimin, Shin Jinwoo, and Lee Honglak. 2019. Overcoming catastrophic forgetting with unlabeled data in the wild. In Proceedings of the IEEE International Conference on Computer Vision. 312321.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Li Xilai, Zhou Yingbo, Wu Tianfu, Socher Richard, and Xiong Caiming. 2019. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Proceedings of the International Conference on Machine Learning.Google ScholarGoogle Scholar
  23. [23] Li Zhizhong and Hoiem Derek. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017), 29352947.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Mallya Arun and Lazebnik S.. 2018. PackNet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.77657773.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] McClelland James L., McNaughton Bruce L., and O’Reilly Randall C.. 1995. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review 102, 3 (1995), 419.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Murre Jaap M. J. and Dros Joeri. 2015. Replication and analysis of Ebbinghaus’ forgetting curve. PloS One 10, 7 (2015), e0120644.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Ostapenko Oleksiy, Puscas Mihai, Klein Tassilo, Jahnichen Patrick, and Nabi Moin. 2019. Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 1132111329.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS’19). 8026–8037.Google ScholarGoogle Scholar
  29. [29] Prabhu Ameya, Torr Philip, and Dokania Puneet. 2020. GDumb: A simple approach that questions our progress in continual learning. In Proceedings of the European Conference on Computer Vision (ECCV’20).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Rajasegaran Jathushan, Khan Salman, Hayat Munawar, Khan Fahad Shahbaz, and Shah Mubarak. 2020. iTAML: An incremental task-agnostic meta-learning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1358813597.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Rebuffi Sylvestre-Alvise, Kolesnikov Alexander, Sperl Georg, and Lampert Christoph. 2017. iCaRL: Incremental classifier and representation learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 55335542. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Riemer Matthew, Klinger Tim, Bouneffouf Djallel, and Franceschini Michele. 2019. Scalable recollections for continual lifelong learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 13521359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Shim Dongsub, Mai Zheda, Jeong Jihwan, Sanner Scott, Kim Hyunwoo, and Jang Jongseong. 2021. Online class-incremental continual learning with adversarial Shapley value. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 96309638.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Shin Hanul, Lee Jung Kwon, Kim Jaehong, and Kim Jiwon. 2017. Continual learning with deep generative replay. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 29902999.Google ScholarGoogle Scholar
  35. [35] Smith Linda and Gasser Michael. 2005. The development of embodied cognition: Six lessons from babies. Artificial Life 11, 1-2 (2005), 1329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Tao Xiaoyu, Chang Xinyuan, Hong Xiaopeng, Wei Xing, and Gong Yihong. 2020. Topology-preserving class-incremental learning. In Computer Vision—ECCV 2020. Lecture Notes in Computer Science, Vol. 12364. Springer, 254–270.Google ScholarGoogle Scholar
  37. [37] Ven Gido M. van de and Tolias Andreas S.. 2019. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734 (2019).Google ScholarGoogle Scholar
  38. [38] Wu Chenshen, Herranz Luis, Liu Xialei, Weijer Joost van de, and Bogdan Raducanu. 2018. Memory replay GANs: Learning to generate new categories without forgetting. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). 59625972.Google ScholarGoogle Scholar
  39. [39] Wu Y., Chen Yan-Jia, Wang Lijuan, Ye Yuancheng, Liu Zicheng, Guo Yandong, and Fu Yun. 2019. Large scale incremental learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).374382.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Continual Recognition with Adaptive Memory Update

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3s
          June 2023
          270 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3582887
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 February 2023
          • Online AM: 5 December 2022
          • Accepted: 20 November 2022
          • Revised: 17 October 2022
          • Received: 10 May 2022
          Published in tomm Volume 19, Issue 3s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)329
          • Downloads (Last 6 weeks)58

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!