Abstract
In this article, we propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. We take the perspective of a learning agent that is interested in recognizing object instances in an open dynamic universe in a way in which any update to its internal feature representation does not render the features in the gallery unusable for visual search. We refer to this learning problem as Compatible Lifelong Learning Representations (CL2R), as it considers compatible representation learning within the lifelong learning paradigm. We identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features. Extensive experiments on standard benchmark datasets show that our CL2R training procedure outperforms alternative baselines and state-of-the-art methods. We also provide novel metrics to specifically evaluate compatible representation learning under catastrophic forgetting in various sequential learning tasks. Code is available at https://github.com/NiccoBiondi/CompatibleLifelongRepresentation.
1 INTRODUCTION
The universe is dynamic, and the emergence of novel data and new knowledge is unavoidable. The unique ability of natural intelligence to lifelong learning is highly dependent on memory and knowledge representation [18]. Through memory and knowledge representation, natural intelligent systems continually search, recognize, and learn new objects in an open universe after exposure to one or a few samples. Memory is substantially a cognitive function that encodes, stores, and retrieves knowledge. Artificial representations learned by Deep Convolutional Neural Network (DCNN) models [3, 61, 63, 64, 76] stored in a memory bank (i.e., the gallery-set) have been shown to be quite effective in searching and recognizing objects in an open-set/open-world learning context. Successful examples are face recognition [10, 14, 59], person re-identification [78, 79, 80], and image retrieval [19, 65, 73].
These approaches rely on learning feature representations from static datasets in which all images are accessible at training time. However, dynamic assimilation of new data for lifelong learning suffers from catastrophic forgetting: the tendency of neural networks to abruptly forget previously learned information [37, 52].
In the case of visual search, even avoiding catastrophic forgetting by repeatedly training DCNN models on both old and new data, the feature representation still irreversibly changes [31]. Thus, to benefit from the newly learned model, features stored in the gallery must be reprocessed and the “old” features replaced with the “new” ones. Reprocessing not only requires the storage of the original images (a noticeable leap from natural intelligence) but also their authorization to access them [66]. More importantly, extracting new features at each update of the model is computationally expensive or infeasible in the case of large gallery-sets. The speed at which the representation is updated to benefit from the newly learned data may impose time constraints on the re-indexing process. This may occur from timescales on the order of weeks/months as in retrieval systems or social networks [62], to within seconds as in autonomous robotics or real-time surveillance [43, 48]. Recently, in the work of Shen et al. [62], a novel training procedure was proposed to avoid re-indexing the gallery-set. The representation obtained in this manner is said to be compatible, as the features before and after the learning upgrade can be directly compared. Training takes advantage of all data from previous tasks (i.e., no lifelong learning), guaranteeing the absence of catastrophic forgetting. The advantage of considering compatible representation learning within the lifelong learning paradigm, as in this work, is that compatible representation allows visual search systems not only to distribute the computation over time but also to avoid or possibly limit the storage of images on private servers for gallery data. This can have important implications for the societal debate related to privacy, ethical, and sustainable issues (e.g., carbon footprint) of modern AI systems [11, 49, 60, 66].
We identify stationarity as the key requirement for feature representation to be compatible during lifelong learning. Stationary features have been shown to be biologically plausible in many studies of working memory in the prefrontal cortex of macaques [33, 39, 40]. Some works [39, 40] decoded the information from the neural activity of the working memory using a classifier with a single fixed set of weights. They noted that a non-stationary feature representation seems to be biologically problematic since it would imply that the synaptic weights would have to change continuously for the information to be continuously available in memory.
Inspired by this, in this article, we formalize the problem of Compatible Lifelong Learning Representations (\(\textbf {CL}^{\bf 2}{\bf R}\)) in relation to the relevant areas of compatible learning and lifelong (continual) learning. We call any training procedure that aims to obtain compatible features and minimize catastrophic forgetting as CL2R training, and we propose (1) a novel set of metrics to properly evaluate CL2R training procedures, (2) a training procedure based on rehearsal [52, 54], and feature stationarity [46, 47] to jointly address catastrophic forgetting and feature compatibility. Figure 1 provides an overview of the problem and the training procedure. Specifically, our CL2R training procedure is achieved by encouraging global and local stationarity to the learned features.
Fig. 1. Overview of the Compatible Lifelong Learning Representations (CL2R) problem and proposed training procedure. The learning agent searches object instances from query images \(I_\mathcal {Q}\) without re-indexing the gallery-set. Any update to the internal feature representation \(\phi\) does not render the features in the gallery-set unusable (i.e., no images are stored). Compatible feature representation under catastrophic forgetting is learned imposing stationarity to features learned from the the class-incremental learning surrogate task. Training is based on rehearsal with the episodic memory \(\mathcal {M}_t\) .
The rest of the article is organized as follows. In Section 2, we discuss related work, and in Section 3, we highlight our contributions. Section 4 presents the formulation of CL2R, Section 5 proposes new metrics to evaluate compatibility, and Section 6 describes a new training procedure. In Section 7, we compare our results with adapted state-of-the-art methods. Section 8 presents the ablation study. We conclude in Section 9.
2 RELATED WORK
Compatible learning. The work proposed by Shen et al. [62], called Backward-Compatible Training (BCT), first formalizes the problem of learning compatible representation to avoid re-indexing. The method takes advantage of an influence loss that encourages the feature representation toward one that can be used by the old classifier. The old classifier is fixed while learning with the novel data (i.e., its parameters are no longer updated by back-propagation) and cooperates with the new representation model. Cooperation is achieved by aligning the prototypes of the new classifier with the prototypes of the old fixed one. The underlying assumption is that the upgraded feature representation follows the representation learned by the old classifier. BCT has been evaluated in scenarios without the effects of catastrophic forgetting by repeatedly training DCNN models on both old and new images (i.e., jointly re-training from scratch at each upgrade). To compare with this learning strategy in a lifelong learning scenario instead of starting from scratch every time, we have added to BCT the capability of learning by fine-tuning the previously learned model according to a memory-based rehearsal strategy [52, 54].
Compatibility under catastrophic forgetting has been implicitly studied in the work of Iscen et al. [25] (FAN), in which authors presented a method for storing features instead of images in Class-incremental Learning (CiL). They introduce a feature adaptation function to update the preserved features as the network learns novel classes. We compared to this method by storing the updated-preserved features obtained at each task. Although designed to improve classification accuracy, the work can be considered close to a lifelong learning approach with compatible representation in which the feature adaptation function they defined addresses implicitly the problem of feature compatibility as in other works [6, 23, 38, 68]. Differently from BCT, these methods do not completely prevents the cost of re-indexing since the learned mappings require evaluation every time the dataset is upgraded and are therefore they are not suited to lifelong learning and/or large gallery-set. For example, the ResNet-101 architecture is one order slower than the mapping proposed in the work of Chen et al. [6]; therefore, when the size of the gallery increases by an order of magnitude, it is equivalent to re-indexing the images. The method described in the work of Ramanujan et al. [51], in addition to the current feature model, trains from the same data an auxiliary model in a different way (i.e., using self-supervised learning). The auxiliary feature model will then be used with future learned models to learn a mapping model to obtain compatible representations as in other works [25, 38, 68]. The underlying assumption is that as the auxiliary feature model is trained with a different strategy, it encodes different knowledge that may facilitate learning the mapping between the representation spaces.
Compatibility of the representation in a more general sense has been considered in the work of Li et al. [31] and Wang et al. [70], where similarity between features extracted from identical architectures and trained from different initialization has been extensively evaluated. The work of Budnik and Avrithis [5] avoids re-indexing the gallery, although the new model used for queries is not trained on more data. Their work is motivated by the scenario where the gallery is indexed by a large model and the queries are captured from mobile devices in which the use of small models is the only viable solution.
Lifelong learning. Lifelong learning or continual learning studies the problem of learning from a non-i.i.d. stream of data with the goal of assimilating new knowledge preventing catastrophic forgetting [9, 37]. Methods for preventing catastrophic forgetting have been explored primarily in the classification task, where catastrophic forgetting often manifests itself as a significant drop in classification accuracy [2, 13, 35, 41, 67]. The key aspects that distinguish lifelong feature learning for visual search from classification are the following: (i) categorical data often have coarser granularity than visual search data, (ii) evaluation metrics do not involve classification accuracy, and (iii) class labels are not required to be explicitly learned. These differences may suggest that these two catastrophic forgetting occurrences are of different origins. In this context, recent works have discussed the importance of the specific task in assessing catastrophic forgetting of learned representations [1, 7, 8, 12, 47, 50]. Among others, empirical evidence presented in the work of Davari and Belilovsky [12] suggests that feature forgetting is not as catastrophic as classification forgetting and that many approaches that address the problem of catastrophic forgetting do not improve feature forgetting in terms of the usefulness of the representation. We argue that such evidence is relevant in visual search and that it can be exploited with techniques that further encourage learning compatible feature representation. According to this, we consider CiL as the basic building block for the general purpose of learning feature representation incrementally.
In this article, the focus is on CiL methods based on Knowledge Distillation (KD) [21] and rehearsal [55], which are known to be versatile, effective, and widely applicable to reduce catastrophic forgetting. We leverage the classification task in CiL as a surrogate task to learn feature representation as typically performed in face/body identification and retrieval [14, 65, 79]. The work of Li and Hoiem [32] first introduces KD in lifelong learning as an effective way to preserve the knowledge previously acquired from old tasks. In iCaRL [53], KD is combined with rehearsal to reserve samples of exemplars stored in an episodic memory for classes already seen. The BiC work, proposed by Wu et al. [71], extends the work of Rebuffi et al. [53] by developing a bias correction layer to recalibrate the output probabilities learning an additional linear layer on a small set of data. Along a similar vein, in the work of Zhao et al. [77], the bias correction is performed by aligning the norms of the weight vectors of the classifier for new classes to those for old classes without using additional model parameters or reserved data. The work of Romero et al. [56] introduces Feature Distillation (FD), a distillation loss evaluated on the feature vectors instead on the classifier outputs. FD has recently been successfully applied by Hou et al. [22] (LUCIR) and Douillard et al. [16] to reduce catastrophic forgetting. Differently from LUCIR, PODNet uses a spatial-based distillation loss to constrain the statistics of intermediate features after each residual block. Similar to LUCIR, PODNetm and many others works on continual/lifelong learning in the literature, our problem formulation takes advantage of the general concept of KD. Differently from these works, our approach is novel in that it considers FD for the dual purpose of learning feature compatibility and mitigating feature forgetting. The work of Iscen et al. [25] (FAN), also discussed in the previous paragraph, combines strategies from other works [22, 32, 53] to learn and preserve previous features. Although the work does not consider the compatibility problem, it is the closest work to our approach. Recently, Yan et al. [72] (DER) showed an interesting performance improvement in CiL by freezing the previously learned representation and expanding its dimension from a new learnable feature extractor. Despite the clear improvements in classification performance, this has no trivial exploitation in compatible training, as the varying dimensions across tasks do not allow direct application of nearest-neighbor search between models. Features with different dimensions typically require to be projected into a common single space to allow nearest-neighbor to be applied. The FOSTER method [69] improves upon DER by addressing this specific problem by transforming the growing dimension of the feature representation with a trainable linear layer that maps the growing feature vector into a fixed dimension. More in general, CiL methods addressing catastrophic forgetting are in a certain sense related to compatible representation, since forgetting is the change in the feature representation of classifiers that will be learned in the future. We evaluate these methods as baselines to quantify the level of lifelong-compatible representation they intrinsically may have.
3 MAIN CONTRIBUTIONS
(1) | We consider compatible representation learning within the lifelong learning paradigm. We refer to this general learning problem as CL2R. | ||||
(2) | We define a novel set of metrics to properly evaluate CL2R training procedures. | ||||
(3) | We propose a CL2R training procedure that imposes global and local stationarity on the learned features to achieve compatibility between representations under catastrophic forgetting. Global and local interactions show a significant performance improvement when local stationarity is promoted only from already observed samples in the episodic memory. | ||||
(4) | We empirically assess the effectiveness of our approach in several benchmarks showing improvements over baselines and adapted state-of-the-art methods. | ||||
4 CL2R Problem Formulation
In a CL2R setting, a sequence of representation models, \(\lbrace \phi _t \rbrace _{t=1}^{T}\), is learned incrementally with a sequence of T tasks, \(\lbrace (\mathcal {D}_t, K_t) \rbrace _{t=1}^T\), where \(\mathcal {D}_t\) are the images of the t-th task represented by \(K_t\) different classes. Specifically, each task is disjoint from the others: \(K_k \cap K_t= \emptyset\) with \(t \ne k\). The learned representation model \(\phi _t\) is used to transform the query images into feature vectors that are used to retrieve the images most similar to a set of given gallery images transformed with a previous model \(\phi _k\). Specifically, we indicate with the couple \(\mathcal {G}=(I_\mathcal {G},F_\mathcal {G})\) the gallery-set, where \({I}_\mathcal {G}=\lbrace \mathbf {x}_i\rbrace _{i=1}^N\) is the image collection from which the features \(F_\mathcal {G}=\lbrace \mathbf {f}_i \rbrace _{i=1}^N\) are extracted, and N is the number of elements of the two sets. Without loss of generality, we assume that the features in \(F_\mathcal {G}\) are extracted using the representation model \(\phi _{ k}:{\mathbb {R}}^D \rightarrow {\mathbb {R}}^d\) that transforms an image \(\mathbf {x} \in {\mathbb {R}}^D\) into a feature vector \(\mathbf {f} \in {\mathbb {R}}^d\), where d and D are the dimensionality of the feature and the image space, respectively. Analogously, we will refer to \(\mathcal {Q}=(I_\mathcal {Q},F_\mathcal {Q})\) as the query-set, where \(I_\mathcal {Q}\) and \(F_\mathcal {Q}\) are the corresponding image-set and the feature-set, respectively. As the t-th task becomes available, the model \(\phi _{t}\) is incrementally learned from the previous one along with the new task data \(\mathcal {D}_t\). Our goal is to design a training procedure to learn the model \(\phi _{t}\) so that any query image transformed with it can be used to perform visual search through some distance \({\rm dist}:{\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow \mathbb {R}_+\) to identify the closest features \({F}_\mathcal {G}\) to the query features \({F}_\mathcal {Q}\) without forgetting the previous representation and without computing \(F_\mathcal {G}=\lbrace \mathbf {f} \in \mathbb {R}^d \, | \, \mathbf {f} = \phi _{t}(\mathbf {x}) \, \forall \mathbf {x} \in I_\mathcal {G}\rbrace\) (i.e., re-indexing). If this holds, then the resulting representation \(\phi _{t}\) is said to be lifelong compatible with \(\phi _{k}\).
The main challenge of the CL2R problem is to jointly alleviate catastrophic forgetting and learn a compatible representation between the previously learned models. In Figure 1, we illustrate the complete CL2R training example using rehearsal to alleviate the effects of catastrophic forgetting.
5 COMPATIBILITY EVALUATION
A representation model \(\phi _{\rm new}\) upgraded with new data is said to be compatible with an old representation model \(\phi _{\rm old}\) when it holds [62]: (1) \(\begin{equation} M\big (\phi _{\rm new}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}} \big) \gt {M} \big (\phi _{\rm old}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}} \big). \end{equation}\)
Equation (1) represents the Empirical Compatibility Criterion (ECC), where \({M}\) is an evaluation metric specific to the given visual search problem. Notable examples of the metric M can be found in face verification accuracy [24, 30], face verification/identification accuracy in terms of true acceptance rate and false acceptance rate (TAR\(@\)FAR) [27], and person re-identification mean average precision (mAP) [74]. The intuition of these metrics is based on the observation that they can be instantiated with two different representation models \(\phi _{\rm new}\) and \(\phi _{\rm old}\) when considering the query-gallery pair. The specific notation \({M} (\phi _{\rm new}^{\mathcal {Q}}, \phi _{\rm old}^{\mathcal {G}})\) defines the cross-test between the new and the old model, and it represents the case in which \(\phi _{\rm new}\) is used to extract the features of the query-set, \(F_\mathcal {Q}\), whereas \(\phi _{\rm old}\) is used to extract the gallery-set ones, \(F_\mathcal {G}\). \({M} (\phi _{\rm old}^{\mathcal {Q}}, \phi _{\rm old}^\mathcal {G})\) is the self-test, and it represents the case in which both query and gallery features are extracted with \(\phi _{\rm old}\). When the model is trained incrementally on T tasks, Equation (1) is evaluated according to the multi-model ECC introduced by Biondi et al. [4]: (2) \(\begin{eqnarray} M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) \gt M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big) {\rm \quad with \:} t \gt k, \end{eqnarray}\) where \(t, k \in \lbrace 1,2,\ldots ,T\rbrace\) refer to two different tasks such that task k is processed by the model before task t. The model \(\phi _t\) is compatible with the model \(\phi _k\), when the cross-test \(M (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G})\) between \(\phi _t\) and \(\phi _k\) is greater than the self-test \(M (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G})\) of the model \(\phi _k\). The underlying intuition is that if the performance of matching the gallery feature vectors extracted with the old model with the query feature vectors extracted with the new model (i.e., cross-test) is better than the performance of matching the gallery feature vectors with the query feature vectors both extracted with the old model (i.e., self-test), then the system is learning compatible representations. In other words, learning from the new task data improves the representation without breaking the compatibility with the previously learned model. Based on Equation (2), the compatibility matrix C is defined as follows: (3) \(\begin{equation} C_{t, k} = {\left\lbrace \begin{array}{ll} M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) & \text{if} t \gt k \\ M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big) & \text{if} t = k \\ \qquad 0 & \text{if} t \lt k \end{array}\right.}, \end{equation}\) where the element in the row t and the column k of the compatibility matrix denotes the evaluation metric M of the model t to the model k. This definition combines the basic intuition of the classification accuracy matrix R defined elsewhere [15, 34], used to evaluate the CiL problem, with the two specific aspects that distinguish the \(\text{CL}^2\text{R}\) learning setting from the CiL one. Namely, (i) in CiL at each task, the train and test data are sampled from the same distribution, whereas in \(\text{CL}^2\text{R,}\) the test-set classes are sampled from an unknown distribution (i.e., \(\text{CL}^2\text{R}\) addresses the open-set recognition problem); (ii) in CiL, the test-set is dynamic (i.e., it grows including images from the task distributions), whereas in \(\text{CL}^2\text{R,}\) it is assumed static for the purpose of a reliable evaluation [62]. In the \(\text{CL}^2\text{R}\) setting, a dynamic test-set, as used in CiL, is of difficult definition, as there are infinite ways to make the gallery dynamic and each of them may change unexpectedly the performance of the evaluation. We follow Shen et al. [62] and perform the evaluation assuming a static test-set (i.e., a static query-gallery pair). According to this, we set the elements of the matrix C with \(t\lt k\) to zero to indicate the impossibility of a reliable evaluation of a growing test-set that should be sampled from an unknown changing distribution. For the remaining elements, the cross-test values are the elements of the matrix with \(t \gt k\), whereas the self-test values are those of the main diagonal (i.e., when \(t = k\)). Given a compatibility matrix C, the average compatibility (AC) is defined as follows: (4) \(\begin{equation} AC = \frac{2}{T(T-1)} \sum \limits _{1 \le k \lt t \le T}{1\!\!1}{ \Big (M \big (\phi _t^\mathcal {Q}, \phi _k^\mathcal {G} \big) \gt M \big (\phi _k^\mathcal {Q}, \phi _k^\mathcal {G} \big)} \Big), \end{equation}\) where \({1\!\!1}(\cdot)\) denotes the indicator function. AC summarizes the compatibility matrix values in a single number that quantifies the number of times that compatibility is verified against all possible \(\frac{T(T-1)}{2}\) occurrences.
5.1 Proposed CL2R Metrics
The work of Díaz-Rodríguez et al. [15] and Lopez-Paz and Ranzato [34] proposes a set of metrics to assess the ability of the learner to transfer knowledge based on a matrix that reports the test classification accuracy of the model on task j after learning task i. Along a similar vein, we present a set of metrics to evaluate the compatibility between representation models in a compatible lifelong learning setting.
Let \(C \in \mathbb {R}^{ T \times T}\) be the compatibility matrix of Equation (3) for T tasks, and the proposed criteria are the following:
From Equations (5) and (6), it can be deduced that BC and FC \(\in [-1,1]\). Backward compatibility for the first task and forward compatibility for the last task are not defined. The larger these metrics, the better the model. When AC values are comparable, both BC and FC represent two metrics that quantify the positive interaction between search accuracy under catastrophic forgetting and compatibility. This allows evaluating how catastrophic forgetting affects the representation and its compatibility.
As BC evaluates the relationship between the representations learned at the final task T and the previous ones, it is possible to follow their evolution during CL2R training. According to this, we define the backward compatibility at task t as \(BC{(t)} = \frac{1}{t-1} \sum _{{c}k=1}^{t-1} (C_{t,k} - C_{k,k}), \; {\rm with } \; t \gt 1\) where \(t \in \lbrace 1, 2, \ldots , T\rbrace\). This represents the average of the element-wise difference between the t-th row and the first t elements of the main diagonal on the compatibility matrix.
6 Proposed CL2R Training
To achieve compatibility, we encourage global and local stationarity to the feature representation.
Global stationarity is encouraged according to the approach described in the work of Pernici et al. [46], in which features are learned to follow a set of special fixed classifier prototypes. Pernici et al. [46] impose global stationarity using a classifier in which prototypes cannot be trained (i.e., fixed) and are set before training. Under this condition, only the direction of the features aligns toward the fixed directions of the classifier prototypes and not the opposite. This constraint imposes learned features to follow their corresponding fixed prototypes, therefore encouraging representation stationarity. The lack of trainable classifier functionality is basically replaced by previous layers. Fixed prototypes are set according to the coordinate vertices of a d-Simplex regular polytope that, in addition to stationarity, allows maximally separated features to be learned [44, 45].
We take advantage of this result and perform CiL as a surrogate task to learn stationary features’ representation to achieve compatibility. More formally, let \(\mathbf {W} \; \forall t \in \lbrace 1, 2, \ldots , T\rbrace\) be the d-Simplex fixed classifier, and we instantiate the CiL problem as \(\sigma (\phi _t \circ \mathbf {W})\), where \(\sigma\) indicates the softmax function, and perform learning according to incremental fine-tuning. The evolving training-set \(\mathcal {T}_t \leftarrow \mathcal {M}_{t} \cup \mathcal {D}_t\) is computed according to a rehearsal base strategy using the episodic memory, \(\mathcal {M}_{t}\), which contains an updating set of samples from \(\lbrace \mathcal {D}_1, \ldots , \mathcal {D}_{t-1} \rbrace\). The memory is updated as \(\mathcal {M}_{t+1} \leftarrow \mathcal {M}_{t} \cup {\rm S}{\rm\small{AMPLING}}({\rm }D_{t})\). The loss optimized in the work of Pernici et al. [46] is adapted to CL2R training as follows: (7) \(\begin{eqnarray} \mathcal {L}_t= -\dfrac{1}{|\mathcal {T}_{t}|} \sum \limits _{\mathbf {x} \in \mathcal {T}_{t}} \log \! \left(\dfrac{\exp { \big ({\mathbf {w}}_{y_i}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} }\big)}{\sum \nolimits _{\scriptscriptstyle j \in K_s} \exp \big ({ {\mathbf {w}}_{j}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} }\big) + \sum \nolimits _{\scriptscriptstyle j \in K_u} \exp {\big ({\mathbf {w}}_{j}^{\top }\cdot {\mathbf {\phi (\mathbf {x})}} \big) }} \right) , \end{eqnarray}\) where \(K_s\) is the set of classes learned up to time t, \(|\mathcal {T}_{t}|\) is the number of elements in the training-set, \(K_u\) is the set of the outputs of the classifier that have not yet been assigned to classes at time t (i.e., future unseen classes [47]), \(\mathbf {w}^{\top }_{(\cdot)}\) is a class prototype of the fixed classifier \(\mathbf {W}\), and \(y_i\) is the supervising label. In particular, \(\mathbf {W}\) is the weight matrix of the fixed classifier, which does not undergo learning during model training. In the work of Pernici et al. [46], the d-Simplex prototypes are defined as \(\mathbf {W} = \lbrace e_1,e_2,\dots ,e_{d-1}, \alpha \sum _{i=1}^{d-1} e_i \rbrace ,\) where d is the feature dimensionality of the d-Simplex, \(\alpha =\frac{1-\sqrt {d+1}}{d}\), and \(e_i\) denotes the standard basis in \(\mathbb {R}^{d-1}\), with \(i \in \lbrace 1,2, \dots , d-1\rbrace\).
The loss of Equation (7) imposes global stationarity and does not require any knowledge to be extracted from the previously learned models. However, catastrophic forgetting causes misalignment between features and fixed classifier prototypes. Therefore, we further impose additional stationarity constraints in a local neighborhood of a feature by encouraging the current model to mimic the feature representation of the model previously learned. This allows the overall stationarity to also be determined by a local learning mechanism interacting with the global one provided by the d-Simplex classifier of Equation (7). The global-to-local interaction is achieved through the FD loss [56]. Differently from the more common practice of FD in CiL [16, 22, 26] in which each mini-batch is sampled from both the episodic memory and the current task (i.e., \(\mathcal {T}_t \leftarrow \mathcal {M}_{t} \cup \mathcal {D}_t\)), we evaluate the FD loss, at each task t, only on the samples stored in episodic memory \(\mathcal {M}_t\) observed from previous tasks: (8) \(\begin{equation} \mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}= \frac{1}{|\mathcal {M}_{t}|} \sum _{\mathbf {x}_i \in \mathcal {M}_{t}} \left(1 - \frac{\phi _{t}(\mathbf {x}_i) \cdot \phi _{t-1}(\mathbf {x}_i)}{\left\Vert \phi _{t}(\mathbf {x}_i)\right\Vert \left\Vert \phi _{t-1}(\mathbf {x}_i)\right\Vert } \right), \end{equation}\) where \(\phi _{t-1}\) is the model learned from the previous task. This encourages local stationarity and stability from only the previous classes in the episodic memory and the assimilation of new knowledge (plasticity) from only the classes of the current task. As confirmed by ablation in Section 8, this learning strategy leads to a significant performance improvement. The final optimized loss function is the sum of Equations (8) and (7): (9) \(\begin{equation} \mathcal {L} = \mathcal {L}_{t} + \lambda \; \mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}, \end{equation}\) where \(\lambda\) balances the contribution of global and local alignment provided by the two losses. The pseudo-code in Algorithm 1 and in Algorithm 2 detail our training procedure and its application in visual search, respectively.
7 EXPERIMENTAL RESULTS
7.1 Datasets and Verification Protocol
We compare our proposed CL2R training procedure and the baseline methods on several benchmarks: CIFAR10 [28], ImageNet20,1 ImageNet100 [22, 53, 71], Labeled Faces in the Wild (LFW) [24], and IJB-C [36]. Evaluation is performed in the open-set 1:1 search problem, with verification accuracy as the performance metric M in Equations (1) and (2) for all datasets except IJB-C in which the true acceptance rate and false acceptance rate ([email protected]) is used. They are defined as \(\text{TAR} = {\text{TP}}/{(\text{TP} + \text{FN})}\), \(\text{FAR} = {\text{FP}}/{(\text{FP} + \text{TN})}\) and \(\text{ACC} = {(\text{TP} + \text{TN})}/{(\text{TP} + \text{TN} + \text{FP} + \text{FN})}\), where TP, TN, FP, and FN indicate true positives, true negatives, false positives, and false negatives, respectively [27, 58]. Following the verification protocol defined in the work of Huang et al. [24], we generate a set of pairs of images that do or do not belong to the same class. A pair is verified on the basis of the distance between feature vectors of its samples. During the evaluation of task t, \(\phi _t\) is used to extract the feature representation for the first image of each pair (i.e., the query-set) and \(\phi _k\), with \(k \in \lbrace 1, \ldots , t\rbrace\), is used to extract the feature representation for the second image (i.e., the gallery-set). When \(k=t\), the compatibility test is the self-test, and otherwise it is the cross-test between the two representations learned from the tasks at time t and k. For the LFW and IJB-C evaluation, we use the original pairs provided by the respective datasets; for the CIFAR10, ImageNet20, and ImageNet100 evaluation, the verification pairs are randomly generated. As the open-set evaluation requires no overlap between classes of the training-set and test-set, we use CIFAR100 to perform CiL (i.e., classification is the surrogate task from which the feature representation is learned) and the CIFAR10 pairs are used as the verification test-set. Similarly, Tiny-ImageNet200 [29] is used as the training-set to evaluate the ImageNet20 pairs; LFW and IJB-C pairs are evaluated with models trained on CASIA-WebFace [75]. Finally, for ImageNet100, we train the models with images not included in ImageNet100 (i.e., the subset of the images of the remaining 900 classes that we named ImageNet900). These datasets are divided into tasks as described in Section 7.2.


7.2 Implementation Details
Our CL2R training procedure is implemented in PyTorch [42] and uses the publicly available library Continuum [17]. We used four NVIDIA Tesla A100s to train the representation models, and the neural network architectures are based on the PODNet implementation.2 The evaluation is carried out on several ResNet [20] architectures. Specifically, a 32-, 18-, and 50-layer ResNet is used for CIFAR10, ImageNet20 and ImageNet100, and LFW and IJB-C, respectively. As is typically used in CiL [22, 71], the episodic memory \(\mathcal {M}\) contains 20 samples for each class. The value of \(\lambda\) in Equation (9) is set to \(\lambda = \lambda _{\rm base} \sqrt {{k_n}/{k_0}}\) [22], in which \(\lambda _{\rm base}\) is a scalar, \(k_n\) is the number of classes of the current task, and \(k_0\) is the number of old classes in the episodic memory. The training details for each dataset are listed next.
CIFAR100 and CIFAR10. We train the model for 70 epochs for each task with batch size 128, and optimization is performed with SGD with an initial learning rate of 0.1 and weight decay of \(2\cdot 10^{-4}\). The learning rate is divided by 10 at epochs 50 and 64. The input images are RGB, \(32 \times 32\). \(\lambda _{\rm base}\) is set to 5.
Tiny-ImageNet200 and ImageNet20. We train the model for 90 epochs at each task with batch size 256, and optimization is performed with SGD with an initial learning rate of 0.1 and a weight decay of \(2\cdot 10^{-4}\). The learning rate is divided by 10 at epochs 30 and 60. To properly evaluate the models in this learning setting, input images and the ImageNet test images are resized to match the Tiny-ImageNet200 input size (RGB \(64 \times 64\)). \(\lambda _{\rm base}\) is set to 5.
ImageNet900 and ImageNet100. We train the model for 90 epochs in each task with batch size 256, and optimization is performed with SGD with an initial learning rate of 0.1 and weight decay of \(2\cdot 10^{-4}\). The learning rate is divided by 10 at epochs 30 and 60. The input images are RGB, \(224 \times 224\). \(\lambda _{\rm base}\) is set to 10.
CASIA-WebFace and LFW/IJB-C. For each task, we train the model for 120 epochs with batch size 1,024. Optimization is carried out with SGD with an initial learning rate of 0.1 and a weight decay of \(2\cdot 10^{-4}\). The learning rate is divided by 10 at epochs 30, 60, and 90. The input images are RGB, \(112 \times 112\). \(\lambda _{\rm base}\) is set to 10.
In Table 1, we summarize the datasets and the training details of our experiments.
Training-set and test-set of the same configuration have non-overlapping classes to properly evaluate different approaches in a open-set setup.
Table 1. Datasets Used in CL2R Training Procedures
Training-set and test-set of the same configuration have non-overlapping classes to properly evaluate different approaches in a open-set setup.
7.3 Baselines and Compared Methods
We compare our training procedure with both the CiL methods and the recently proposed methods for compatible learning. Our baselines include LwF [32], LUCIR [22], BiC [71], PODNet [16], FOSTER [69], FAN [25], and BCT [62]. In particular, FAN and BCT are the only approaches with an explicit mechanism to address feature compatibility. We adapted FAN so that the learned adaptation functions are used to transform the features into compatible features. Since in BCT the model is trained from scratch at each task using all available data, for a fair comparison, we also re-implemented it with an episodic memory and refer to it as lifelong-BCT (\(\ell\)-BCT). At each task, the model is initialized with the parameters of the model of the previous task and the data of the previous tasks can be accessed only through the episodic memory. For LwF, BiC, and PODNet, we use their publicly available implementations,2 whereas for LUCIR and FOSTER, we adopted their official implementations.3 Finally, we also include a traditional Experience Replay (ER)-based baseline, where the model is continuously fine-tuned as new tasks become available. To evaluate our training procedure without considering the catastrophic forgetting phenomenon, we define as upper bound (UB) our training procedure re-trained from scratch at each task using an episodic memory with infinite size.
7.4 Evaluation on CIFAR10
In this section, we report the experiments in 2-, 3-, 5-, and 10-task CL2R settings with models trained on CIFAR100 (i.e., using 50, 33, 20, and 10 classes per task) where compatibility is evaluated on the CIFAR10 generated pairs.
In Table 2, we summarize the performance of our CL2R training procedure with respect to the other baselines in the two-task scenario. We evaluate the compatibility of the updated model according to the ECC (Equation (1)), BC (Equation (5)), and FC (Equation (6)). The first row of Table 2 reports the verification accuracy of the model trained on the first 50 classes of CIFAR100. Experiments show that, among the methods compared, LUCIR and PODNet may have an inherent, although limited, level of compatible representations. This substantially confirms the importance of having some form of mechanism to preserve the local geometry of the learned features. Our training procedure achieves the highest cross-test, BC, and FC, thus resulting to be the most suited training procedure to avoid re-indexing.
Two-task CL2R setting with models trained on CIFAR100. Initial Task (i.e., the previous task) shows the verification accuracy on the first 50 classes, and the other rows represent the performance obtained after two tasks.
*Not subject to catastrophic forgetting.
Table 2. CIFAR10 Evaluation
Two-task CL2R setting with models trained on CIFAR100. Initial Task (i.e., the previous task) shows the verification accuracy on the first 50 classes, and the other rows represent the performance obtained after two tasks.
*Not subject to catastrophic forgetting.
In the last rows of the table, we report the performance of the BCT and our UB that are not affected by catastrophic forgetting. The effect of catastrophic forgetting and its implications on the reduction of performance in compatibility can be observed in the self-test, as these values are significantly higher than the values reported by the methods learned using CiL.
In Table 3, results for the scenario of 3-, 5-, and 10-task CL2R are presented. For each experiment, we report AC (Equation (4)), BC (Equation (5)), and FC (Equation (6)). As can be noticed, our method always achieves the highest AC, thus obtaining the largest number of compatible representations between models, and always achieves the highest BC between methods that are subject to catastrophic forgetting. FAN achieves almost the same performance as our procedure in the 3-task scenario, but when the number of tasks increases, it has a significant decrease in performance, especially in the 10-task setting. This may be due to the increasing number of adaptation functions between different feature spaces that FAN uses to adapt old features with respect to the new ones. As can be noticed from the two tables, FOSTER does not learn compatible features. This may be due to the fact that feature space compression forces the representation to change abruptly reducing the overall compatibility with previous models. BCT reports higher values since its representation is learned from scratch for each new task. Compared to the UB, our training procedure achieves lower AC and BC, and this is due to the influence of catastrophic forgetting. From the table, it can also be noticed that BiC, LUCIR, and PODNet do not satisfy compatibility when catastrophic forgetting is more severe, as, for example, in the case of 10 tasks. Overall, these results suggest that the interaction between local and global stationarity promoted by our training procedure shows a significant improvement in performance that FD alone cannot provide.
7.5 Evaluation on ImageNet
In this section, we conducted the experiments with models trained on Tiny-ImageNet200 in CL2R settings with 2 (Table 4), 3, 5, and 10 (Table 5) tasks.
The two-task CL2R setting with models trained on Tiny-ImageNet200. The Initial Task (i.e., the previous task) shows verification accuracy on the first 100 classes, and the other rows represent the performance obtained after two tasks.
*Not subject to catastrophic forgetting.
Table 4. ImageNet20 Evaluation
The two-task CL2R setting with models trained on Tiny-ImageNet200. The Initial Task (i.e., the previous task) shows verification accuracy on the first 100 classes, and the other rows represent the performance obtained after two tasks.
*Not subject to catastrophic forgetting.
Table 4 follows the same structure as Table 2 showing the ECC (Equation (1)), BC (Equation (5)), and FC (Equation (6)) values. For all compared methods, the initial model (i.e., the previous model) is trained on the first 100 classes of Tiny-ImageNet200. As can be seen in the table, our method achieves the best performance. However, with low values, other methods such as FAN and LUCIR have a certain level of compatibility, which confirms again that distillation, with which they are equipped, is a useful tool to support learning compatible features. As is also observed in the CIFAR results, methods not subject to catastrophic forgetting (i.e., BCT and our UB), achieve higher BC and lower FC.
Table 5 shows the 3-, 5- and 10-task CL2R settings for Tiny-ImageNet200. In these learning scenarios, each task is made up of 66, 40, and 20 classes, respectively. In this table, we discuss the results by analyzing the values of AC (Equation (4)), BC (Equation (5)), and FC (Equation (6)). Our approach always achieves the highest value of AC. In particular, ER, LwF, BiC, FAN, and \(\ell\)-BCT do not achieve lifelong-compatible representation in the 3-task setting as a result of AC = 0. In the 10-task CL2R setting, it is more evident that as the number of tasks increases, methods without any specific mechanism to preserve the representation typically cannot learn compatible representations. LUCIR, BiC, and \(\ell\)-BCT obtain significantly lower values than our method. Specifically, the AC performance is more than twice that of BCT, which means that our CL2R procedure obtains twice the number of compatible representations than that of BCT. This may be caused by the fact that the constraints imposed by these techniques on the learned representation seem to have very little effect on its stationarity, and consequently on its compatibility. The results on the 10-task setting are also important, as they suggest that catastrophic forgetting is not an intrinsic impediment to learning compatible representations. The performance difference of 0.11 in AC with respect to the UB can be considered clear evidence of this effect. Finally, the table shows how our training procedure provides the highest FC and is the only case where FC is always positive. As a result, our training procedure achieves, on average, cross-tests higher than self-tests indicating that the system performs better even without re-indexing the gallery.
Table 6 reports ImageNet100 results when models are trained on ImageNet900 with two and three tasks. We compare our approach with the \(\ell\)-BCT method as having reasonable performance and with an explicit mechanism to learn compatible features under catastrophic forgetting. As can be noticed from the table, our CL2R training clearly outperforms \(\ell\) -BCT. Our method achieves good scores for AC in both scenarios. As remarked in the Section 7.5 of the novel revised manuscript, the reduced performance of \(\ell\)-BCT appears to be connected to the fact that the training procedure is only based on pairwise model training (i.e., compatibility is only learned from the previous model). In contrast, our method is not based only on pairwise learning and does not use previous classifiers, which may be incorrectly learned.
7.6 Face Verification
In this section, we report the experimental results on the LFW and IJB-C benchmarks in 2, 3, 5, and 10 CL2R settings. We incrementally train the representation models with CASIA-WebFace resulting in tasks composed of 5,287, 3,525, 2,115, and 1,057 classes, respectively.
The results are summarized in Tables 7 and 8 for LFW and IJB-C, respectively. In particular, for IJB-C, we report accuracy in terms of AC, BC, and FC at different false acceptance rates (FAR): \(10^{-1}\), \(10^{-2}\), and \(10^{-4}\). In this evaluation, we do not report LUCIR when training on CASIA-WebFace due to the excessive memory requirements of the original implementation.3 Although in the 2-task scenario comparable results are observed to those of \(\ell\)-BCT, in the settings of 3 and 5 tasks, our training procedure achieves complete compatibility resulting in AC = 1 and BC always positive. In 10-task compatibility, the difference in performance increases more significantly, confirming a clear overall positive performance. Generally, the reported performances are higher on face datasets than on CIFAR10, ImageNet20, and ImageNet100. Possible reasons may be found in the fact that in face recognition, the domain shift between classes is lower than that for CIFAR or ImageNet. Finally, this experiment shows that the proposed method is effective not only with a larger number of model updates but also with larger datasets.
Table 7. Face Verification on the LFW Dataset
Table 8. Face Verification on the IJB-C Dataset
7.7 Compatibility and Catastrophic Forgetting
In this section, we study how compatibility is related to the problem of catastrophic forgetting. In Figure 2, we show the evolution of BC in a 5- and 10-task CL2R scenario. In particular, Figure 2(a) and (b) and Figure 2(c) and (d) show the evaluations on the CIFAR10 and ImageNet20 datasets, respectively. We compared our approach with ER, LwF, BiC, LUCIR, FAN, and \(\ell\)-BCT. As can be observed, our training procedure achieves the highest performance. As the BC metric is, on average, the closest to zero than the other evaluated methods, the representation learned by our training procedure can be considered to be the most compatible and, from the perspective of visual search, equivalent to the representation models learned from previous tasks. More practically, this allows for the reduction of the computational cost of re-indexing.
Fig. 2. Backward compatibility evolution across tasks t (i.e., \(BC(t)\) ). Comparison between our CL2R training and other methods in 5- and 10-task learning setups. (a, b) CIFAR10 results. (c, d) ImageNet20 results.
In contrast, FAN achieves a negative value of BC in all four settings, confirming that the composition of an increasing number of feature adaption functions between sequentially learned representations causes a decrease in compatibility. Despite the absence of considerable performance loss, as in the case of FAN, negative BC values indicate a constant deterioration in performance as the number of tasks increases. In general, except for our method, the figure shows that all other methods follow a common trend with lower performance.
8 ABLATION STUDIES
We analyze by ablation the main components of our training procedure. The ablation is performed on the CIFAR100 dataset as described in Section 7.4 and considers the 10-task CL2R setting, which can be regarded as a worst-case scenario for this dataset. We analyze the impact of (i) the specific classifier: Trainable vs. fixed d-Simplex with or without the FD component, (ii) how the FD loss is evaluated, and (iii) the sensitivity of the number of samples reserved per class in the episodic memory.
Impact of the d-Simplex fixed classifier and FD. As can be noticed from Table 9, the Trainable classifier is not able to learn compatible representations. When combined with FD, the performance improves only marginally and not sufficiently to be compared with the CiL approaches shown in Table 3. FD evaluated on the only samples stored in episodic memory as defined in \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) (Equation (8)) improves the values of the reported metrics showing a better supervision signal for the updated model. The d-Simplex alone improves on the previous components obtaining values of AC = 0.27 and FC = 0.003, which are higher than the Trainable classifier with \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\). This remarks on the importance of preserving the global geometry of the learned features according to the d-Simplex fixed classifier.
The evaluation is performed on CIFAR10 and training is based on CIFAR100 with 10 tasks, where Trainable indicates the traditional ER baseline, Fixed indicates ER with stationary features learned from Equation (7) according to the fixed d-Simplex classifier, \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}\) is the traditional FD, and \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) is the FD evaluated on the only samples stored in episodic memory as defined in Equation (8).
Table 9. Ablation of the Different Main Components of Our CL2R Training Procedure
The evaluation is performed on CIFAR10 and training is based on CIFAR100 with 10 tasks, where Trainable indicates the traditional ER baseline, Fixed indicates ER with stationary features learned from Equation (7) according to the fixed d-Simplex classifier, \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}\) is the traditional FD, and \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) is the FD evaluated on the only samples stored in episodic memory as defined in Equation (8).
Impact of memory samples on FD. Table 9 shows that when the distillation loss is evaluated on the only samples stored in episodic memory \(\mathcal {L}_{\scriptscriptstyle \textrm {FD}}^{\scriptscriptstyle \mathcal {M}}\) (Equation (8)), our approach achieves better overall results. We argue that this positive effect is mostly due to the interaction between the global feature stationarity learned using the fixed classifier and the local one promoted through FD from the only observed samples in the episodic memory. The interaction is most likely related to the fact that the fixed d-Simplex classifier in general does not allow novel classes to interfere in the feature space of the already learned one. This in turn provides favorable working conditions (i.e., a kind of coarse pre-alignment) for achieving feature alignment with respect to the previous model by the distillation loss. As expected, the impact intensifies when evaluated only on already known classes, as alignment is less prone to unexpected noisy features which may reduce the degree of the alignment. This confirms the effectiveness of restricting FD only on memory samples in contrast to the traditional FD commonly used in CiL.
Impact of the episodic memory size. Figure 3 shows the effect of different numbers of reserved samples per class for both our learning procedure and other baselines. As expected, the more samples per class are reserved in the episodic memory, the better the performance. Our approach, with 20 samples per class, achieves results similar to those obtained by the other methods with more examples per class. Although ER, LUCIR, and FAN have a better relative improvement with 50 samples per class, overall our approach results in the highest performance in learning compatible features.
Fig. 3. The effect of the number of reserved samples per class in the episodic memory.
We also evaluated the methods in the challenging memory-free training setting (i.e., without the episodic memory). Our training procedure achieves the highest results also in this condition, remarking on the fact that CiL methods typically do not have an inherent mechanism to learn compatible features.
9 CONCLUSION
In this article, we have introduced the problem of CL2R, which considers the compatibility learning problem within the lifelong learning paradigm. We introduced a novel set of metrics to properly evaluate this problem and proposed a novel CL2R training procedure that imposes global and local stationarity on the learned features to achieve compatibility between representations under catastrophic forgetting. Global and local stationarity is imposed according to the d-Simplex fixed classifier and the FD loss, respectively. Empirical evaluation of the learned lifelong-compatible representation shows the effectiveness of our method with respect to baselines and state-of-the-art methods.
Footnotes
1 To meet the open-set protocol, we generated a training set from ImageNet [57] by randomly sampling 20 classes that are not included in the Tiny-ImageNet200 dataset. The indices of the ImageNet classes we use are the following: {n02276258, n01728572, n03814906, n02817516, n03769881, n03220513, n04442312, n04252225, n13037406, n04266014, n03929855, n02804414, n01873310, n03532672, n01818515, n03916031, n03345487, n02114855, n04589890, n03776460}.
Footnote2 https://github.com/arthurdouillard/incremental_learning.pytorch.
Footnote3 https://github.com/hshustc/CVPR19_Incremental_Learning and https://github.com/G-U-N/ECCV22-FOSTER.
Footnote
- [1] . 2022. Contrastive supervised distillation for continual representation learning. In Proceedings of the International Conference on Image Analysis and Processing. 597–609.Google Scholar
Digital Library
- [2] . 2021. A comprehensive study of class incremental learning algorithms for visual tasks. Neural Networks 135 (2021), 38–54.Google Scholar
Cross Ref
- [3] . 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828.Google Scholar
Digital Library
- [4] . 2021. CoReS: Compatible representations via stationarity. arXiv preprint arXiv:2111.07632 (2021).Google Scholar
- [5] . 2021. Asymmetric metric learning for knowledge transfer. In Proceedings of the 2021 Conference on Computer Vision and Pattern Recognition (CVPR’21). IEEE, Los Alamitos, CA, 8228–8238.Google Scholar
- [6] . 2019. R3 Adversarial network for cross model face recognition. In Proceedings of the 2019 Conference on Computer Vision and Pattern Recognition (CVPR’19). IEEE, Los Alamitos, CA, 9868–9876.Google Scholar
- [7] . 2021. Feature estimations based correlation distillation for incremental image retrieval. IEEE Transactions on Multimedia 24 (2021), 1844–1856Google Scholar
- [8] . 2020. On the exploration of incremental learning for fine-grained image retrieval. In Proceedings of the 31st British Machine Vision Conference (BMVC’20).Google Scholar
- [9] . 2018. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 12, 3 (2018), 1–207.Google Scholar
Cross Ref
- [10] . 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, Los Alamitos, CA, 539–546.Google Scholar
Digital Library
- [11] . 2021. Sustainable artificial intelligence through continual learning. arXiv preprint arXiv:2111.09437 (2021).Google Scholar
- [12] . 2021. Probing representation forgetting in continual learning. In Proceedings of the NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.Google Scholar
- [13] . 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021), 3366–3385.Google Scholar
Cross Ref
- [14] . 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699.Google Scholar
Cross Ref
- [15] . 2018. Don’t forget, there is more than forgetting: new metrics for Continual Learning. arXiv preprint arXiv:1810.13166 (2018).Google Scholar
- [16] . 2020. PODNet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer, 86–102.Google Scholar
Digital Library
- [17] . 2021. Continuum: Simple management of complex continual learning scenarios. arXiv:2102.06253 (2021).Google Scholar
- [18] . 1995. Long-term working memory. Psychological Review 102, 2 (1995), 211.Google Scholar
Cross Ref
- [19] . 2016. Deep image retrieval: Learning global representations for image search. In Computer Vision—ECCV 2016, , , , and (Eds.). Springer International Publishing, Cham, Switzerland, 241–257.Google Scholar
- [20] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [21] . 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop.Google Scholar
- [22] . 2019. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 831–839.Google Scholar
Cross Ref
- [23] . 2019. Towards visual feature translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3004–3013.Google Scholar
Cross Ref
- [24] . 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
Technical Report 07-49. University of Massachusetts, Amherst.Google Scholar - [25] . 2020. Memory-efficient incremental learning through feature adaptation. In Proceedings of the European Conference on Computer Vision. 699–715.Google Scholar
Digital Library
- [26] . 2018. Less-forgetful learning for domain expansion in deep neural networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [27] . 2015. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1931–1939.Google Scholar
Cross Ref
- [28] . 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report TR-2009. University of Toronto.Google Scholar
- [29] . 2015. Tiny ImageNet visual recognition challenge. CS 231N 7, 7 (2015), 3.Google Scholar
- [30] . 2014. Labeled Faces in the Wild: Updates and New Reporting Procedures.
Technical Report UM-CS-2014-003. University of Massachusetts, Amherst.Google Scholar - [31] . 2015. Convergent learning: Do different neural networks learn the same representations? In Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015. 196–212.Google Scholar
- [32] . 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017), 2935–2947.Google Scholar
Digital Library
- [33] . 2021. Rotational dynamics reduce interference between sensory and memory representations. Nature Neuroscience 24 (2021), 715–726.Google Scholar
Cross Ref
- [34] . 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems 30 (2017).Google Scholar
- [35] . 2020. Class-incremental learning: Survey and performance evaluation. arXiv preprint arXiv:2010.15277 (2020).Google Scholar
- [36] . 2018. IARPA Janus Benchmark C: Face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB’18). IEEE, Los Alamitos, CA, 158–165.Google Scholar
Cross Ref
- [37] . 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, Vol. 24. Elsevier, 109–165.Google Scholar
- [38] . 2021. Learning compatible embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 9939–9948.Google Scholar
Cross Ref
- [39] . 2018. Dynamic population coding and its relationship to working memory. Journal of Neurophysiology 120, 5 (2018), 2260–2268.Google Scholar
Cross Ref
- [40] . 2017. Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. Proceedings of the National Academy of Sciences 114, 2 (2017), 394–399.Google Scholar
Cross Ref
- [41] . 2019. Continual lifelong learning with neural networks: A review. Neural Networks 113 (2019), 54–71.Google Scholar
Digital Library
- [42] . 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019), 8026–8037.Google Scholar
- [43] . 2018. Memory based online learning of deep representations from video streams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
Cross Ref
- [44] . 2019. Fix your features: Stationary and maximally discriminative embeddings using regular polytope (fixed classifier) networks. arXiv preprint arXiv:1902.10441 (2019).Google Scholar
- [45] . 2019. Maximally compact and separated features with regular polytope networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.Google Scholar
- [46] . 2022. Regular polytope networks. IEEE Transactions on Neural Networks and Learning Systems 33, 9 (2022), 4373–4387.Google Scholar
Cross Ref
- [47] . 2021. Class-incremental learning with pre-allocated fixed classifiers. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR’20). IEEE, Los Alamitos, CA, 6259–6266.Google Scholar
Cross Ref
- [48] . 2020. Self-supervised on-line cumulative learning from video streams. Computer Vision and Image Understanding 197 (2020), 102983.Google Scholar
Cross Ref
- [49] . 2019. Privacy in the age of medical big data. Nature Medicine 25, 1 (2019), 37–43.Google Scholar
Cross Ref
- [50] . 2021. Lifelong person re-identification via adaptive knowledge accumulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7901–7910.Google Scholar
Cross Ref
- [51] . 2022. Forward compatible training for representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
- [52] . 1990. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review 97, 2 (1990), 285.Google Scholar
Cross Ref
- [53] . 2017. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2001–2010.Google Scholar
Cross Ref
- [54] . 1993. Catastrophic forgetting in neural networks: The role of rehearsal mechanisms. In Proceedings of the 1993 1st New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. IEEE, Los Alamitos, CA, 65–68.Google Scholar
Cross Ref
- [55] . 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7, 2 (1995), 123–146.Google Scholar
Cross Ref
- [56] . 2015. FitNets: Hints for thin deep nets. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).Google Scholar
- [57] . 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [58] . 2021. A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges. arXiv preprint arXiv:2110.14051 (2021).Google Scholar
- [59] . 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.Google Scholar
Cross Ref
- [60] . 2020. Green AI. Communications of the ACM 63, 12 (2020), 54–63.Google Scholar
Digital Library
- [61] . 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806–813.Google Scholar
Digital Library
- [62] . 2020. Towards backward-compatible representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6368–6377.Google Scholar
Cross Ref
- [63] . 2014. Deep learning face representation by joint identification-verification. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 1988–1996.Google Scholar
- [64] . 2014. DeepFace: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1701–1708.Google Scholar
Digital Library
- [65] . 2016. Particular object retrieval with integral max-pooling of CNN activations. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
- [66] . 2020. The ethical questions that haunt facial-recognition research. Nature 587, 7834 (2020), 354–358.Google Scholar
Cross Ref
- [67] . 2021. Continual learning for classification problems: A survey. In Proceedings of the International Conference on Computational Intelligence in Data Science. 156–166.Google Scholar
Cross Ref
- [68] . 2020. Unified representation learning for cross model compatibility. In Proceedings of the 31st British Machine Vision Conference (BMVC’20).Google Scholar
- [69] . 2022. FOSTER: Feature boosting and compression for class-incremental learning. arXiv preprint arXiv:2204.04662 (2022).Google Scholar
- [70] . 2018. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems, Vol. 31.Google Scholar
- [71] . 2019. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 374–382.Google Scholar
Cross Ref
- [72] . 2021. DER: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3014–3023.Google Scholar
Cross Ref
- [73] . 2015. Aggregating local deep features for image retrieval. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 1269–1277Google Scholar
Digital Library
- [74] . 2021. Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021), 2872–2893.Google Scholar
Cross Ref
- [75] . 2014. Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014).Google Scholar
- [76] . 2014. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, Vol. 27. Google Scholar
- [77] . 2020. Maintaining discrimination and fairness in class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13208–13217.Google Scholar
Cross Ref
- [78] . 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116–1124.Google Scholar
Digital Library
- [79] . 2016. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984 (2016).Google Scholar
- [80] . 2019. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3702–3712.Google Scholar
Cross Ref
Index Terms
(auto-classified)CL2R: Compatible Lifelong Learning Representations
Recommendations
Deep Inductive Network Representation Learning
WWW '18: Companion Proceedings of the The Web Conference 2018This paper presents a general inductive graph representation learning framework called DeepGL for learning deep node and edge features that generalize across-networks. In particular, DeepGL begins by deriving a set of base features from the graph (e.g., ...
Representation Learning: A Review and New Perspectives
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. ...











Comments