Continual Few-shot Learning with Transformer Adaptation and Knowledge Regularization

Continual few-shot learning, as a paradigm that simultaneously solves continual learning and few-shot learning, has become a challenging problem in machine learning. An eligible continual few-shot learning model is expected to distinguish all seen classes upon new categories arriving, where each category only includes very few labeled data. However, existing continual few-shot learning methods only consider the visual modality, where the distributions of new categories often indistinguishably overlap with old categories, thus resulting in the severe catastrophic forgetting problem. To tackle this problem, in this paper we study continual few-shot learning with the assistance of semantic knowledge by simultaneously taking both visual modality and semantic concepts of categories into account. We propose a Continual few-shot learning algorithm with Semantic knowledge Regularization (CoSR) for adapting to the distribution changes of visual prototypes through a Transformer-based prototype adaptation mechanism. Specifically, the original visual prototypes from the backbone are fed into the well-designed Transformer with corresponding semantic concepts, where the semantic concepts are extracted from all categories. The semantic-level regularization forces the categories with similar semantics to be closely distributed, while the opposite ones are constrained to be far away from each other. The semantic regularization improves the model’s ability to distinguish between new and old categories, thus significantly mitigating the catastrophic forgetting problem in continual few-shot learning. Extensive experiments on CIFAR100, miniImageNet, CUB200 and an industrial dataset with long-tail distribution demonstrate the advantages of our CoSR model compared with state-of-the-art methods.


CCS CONCEPTS
• Computing methodologies → Computer vision; Learning paradigms; Machine learning approaches; Knowledge representation and reasoning.

INTRODUCTION
Deep neural networks (DNNs) have achieved great success when a large amount of labeled data are available.For example, DNNs are able to accurately conduct image classifcations upon well training over enough labeled data.In practice, we always have new data and tasks arriving in sequence, craving for an ideal machine learning model which is able to recognize newly-arrived data associated with new classes and maintain the knowledge of previous classes simultaneously.Continual learning is such a learning paradigm, aiming to alleviate the catastrophic forgetting of old classes upon sequential arrival of new classes [16,23].Nevertheless, the amount of newly-arrived data is usually limited, thus requiring that the model can quickly adapt to new classes with few-shot data.Continual fewshot learning, as a paradigm that simultaneously solves continual learning and few-shot learning [1,36], has attracted an increasing number of attentions in the research community recently.
Compared with the traditional learning paradigm, continual fewshot learning is more analogous to human learning since humans can learn new concepts from a limited amount of data and maintain most of the previously old knowledge simultaneously.There are two learning phases in continual few-shot learning, the base learning phase and the continual learning phase.In the base learning phase, the model is trained with base classes with full labeled data for each class.Then in the continual learning phase, the model is expected to learn new classes with a small amount of labeled data while maintaining the knowledge of base classes.Given that the labels from new classes have not appeared in the base classes, continual few-shot learning requires that the model should quickly adapt to the distribution changes from new classes as well as maintain the ability to distinguish all old classes.Due to the unavailable old classes during the continual learning phase, the distributions ftted based on new classes tend to overlap with those ftted based on old classes, which results in the failure of traditional machine learning approaches in distinguishing the new classes from old classes.In sum, there exist two key challenges in continual few-shot learning.
(1) The catastrophic forgetting issue that traditional deep learning models tend to over-ft the new classes while forgetting the knowledge of old classes.(2) The requirement of fast learning ability with only a limited amount of labeled data from new classes.
On the one hand, there are existing works in continual learning, focusing on the problem of catastrophic forgetting.Some works [4,16,19] tackle the forgetting problem through constraining the parameter shifts in the deep learning model.Several works [2,22,23,30] also propose to extend the model or learn parameter masks for new classes.Other works [8,21,27,32] suggest using rehearsal memory by storing samples from previous classes or generating samples to alleviate forgetting.On the other hand, the above models on continual learning sufer from relatively large estimation errors when only a limited number of samples are available, motivating the advent of continual few-shot learning which learns prototypes based on the given support images, and classifes the input image according to distance criteria such as Euclidean distance and cosine distance etc.Several existing approaches [5,24,28,44,45,47] generate adaptive prototypes through well-designed adaptive mechanisms which only utilize visual signals of images without considering the semantic knowledge hidden in text as well as the semantic association between the base and new classes.Other work [7] makes use of word embeddings to align visual prototypes in textual feature space, where the sparsely distributed textual space may not be suitable for visual prototype learning.However, the existing works severely ignore the importance of semantic knowledge in helping to distinguish both new and old classes upon continually arriving tasks, thus failing to solve the catastrophic forgetting and the fast learning problems simultaneously.
To solve the problem, we propose to extract the semantic knowledge from categorical information as a regularization for learning the consistent semantic visual prototypes.This design enables us to conduct continual classifcation through nearest neighbor instead of fne-tuning the model with a limited amount of data.

RELATED WORK
In this section, we review related works on traditional continual learning, few-shot learning and the most recent continual few-shot learning, which are most relevant to our work.
Continual Learning.Continual learning aims to continually learn new categories and classify all seen categories.There are three main solutions from the perspective of reducing catastrophic forgetting or interference.Regularization-based methods [4,16,19] introduce prior distribution of parameters to penalize the shifts of important parameters or use knowledge distillation as data regularization.Model expansion-based methods [2,22,23,30] propose to dedicate diferent parameters to each task by freezing the old task parameters or adding parameter masks.These works usually sufer from large model storage, with new classes continually arriving.Replay-based methods [8,21,27,32,42] store or generate old samples in memory to balance the old and new tasks and alleviate forgetting.Our work is related to continual learning, but more challenging with fewer samples in the new classes.With a few training samples, fne-tuning the feature space will bring a relatively large estimation error.Therefore, we solve the problem through prototype adaptation with semantic knowledge.
Few-shot Learning.Traditional few-shot learning aims to distinguish unseen classes with few given data, but ignores the distinction between the new and old categories [38].There are three mainstreams of few-shot learning.Data generation based methods [12,13,29,39] usually apply a pre-defned data generation function to expand the training data of unseen categories.Metric based methods [6,25,33,34,37] focus on embedding learning or refning for the few-shot tasks.Meta-learning based methods [11,20,40,46] use prior knowledge to search an optimal initialized parameter.
Our work is more related to metric-based methods since we use semantic knowledge as regularization to refne the prototypes in the feature space.Diferent from traditional few-shot learning, continual few-shot learning requires that the model should not only recognize new categories but also maintain the ability to distinguish between all seen categories.
Continual Few-shot Learning.Continual few-shot learning has been studied recently.F2M [31] suggests that fat local minima in the training of base classes can facilitate the learning of new classes.TOPIC [36] and TPCIL [35] are both topology preserved methods that learn and preserve the topology of the feature manifold to mitigate forgetting of the old classes.ERDIL [10] also selects representative samples in old categories to construct a relation graph for knowledge distillation, which eases the forgetting problem in continual few-shot learning.Some works [18,43] solve the problem through calibrating features or classifers of new categories to alleviate forgetting.Other works [5,24,28,44,45,47] propose to generate adaptive prototypes (reference vectors or classifers) through well-designed adaptive mechanism.However, they mostly focus on the single modality of images for feature learning, ignoring the semantic association among classes.Semantic knowledge is used in works [1,7] as external information for continual few-shot learning.Cheraghia et al. [7] make use of word embeddings as semantic information to align visual and semantic vectors with an attention mechanism.The visual features are directly projected to the sparse textual space, which may suppress the efect of visual information.Akyürek et al. [1] propose to learn the new classifers with the regularization of semantic similarity.The semantic association among diferent categories is linearly afne to the space of classifers.
Our work is different from the above methods in semantic regularization terms.We design two schemes for the use of semantic knowledge.One is aligning the visual and textual vectors of each category with a Transformer.Another is extracting semantic prototypes in the visual space using textual information and then enhancing the visual prototypes with semantic ones.

THE PROPOSED METHOD
Our proposed CoSR model consists of two stages, the base learning phase and the continual learning phase.The CoSR model is first trained with base data and then continually learns with newlyarrived few-shot data.

Problem Formulation
We define the continual few-shot learning setting as follows.Given a sequence of tasks { 0 , 1 , ..., }  , where the number of tasks is  .Each task   (0 ≤  ≤  ) contains a test set and a training set . Here, is the number of samples in and ( ) is the label set of task   .Considering that there is no category overlap between different tasks, we have ∀, ,  ( ) ∩ (  ) = ∅.In continual few-shot learning, task  0 is named as the base task, including a full training set of base classes.Each task  ,>0 is a novel task or new task, which only includes a few samples for training.Usually, if there are  data samples within each of the  new classes in each novel task, then the setting is called -way -shot problem.To follow the common practice in continual few-shot learning, the training samples in the base task and novel tasks are severely unbalanced with  normally being smaller than 5, while the size of the test set in each task is balanced.Therefore, we should first train a continual few-shot learning model on the base task and then continue to learn novel tasks in sequence.After training on ( )

𝐷
for the th task, a continual few-shot learning model is evaluated on the test set for both the current task and those previous tasks, i.e., ,  , ...,

𝑡𝑒𝑠𝑡 𝑡𝑒𝑠𝑡
.This challenging setting requires a good continual few-shot learning model to quickly learn with a very small number of data samples for the novel task while maintaining the capability of distinguishing between previous classes and new classes simultaneously.

Continual Few-shot Learning with Semantic Knowledge Regularization
The challenge of continual few-shot learning comes from two aspects: i) the fast adaptive learning of new classes and ii) the catastrophic forgetting of old classes.In the base learning phase, the model is trained with base classes to learn the base task distribution in the latent feature space.In the continual learning phase, the distributions of new classes should be quickly learned given only a few samples, where the distributions of new classes usually tend to overlap with previously old classes in the latent feature space, thus leading to the catastrophic forgetting problem.To tackle the challenging issue, we extract semantic information from the textual modality to help the model to discover better feature space in which the new and previous classes do not overlap.As shown in Figure 2 (a), the base learning phase includes multiple training episodes.In each episode, a support set of -way -shot is sampled from the base database.The query set is also sampled from the same class in the base database as the support set.Both the query and support images are fed into the learnable CNN backbone to obtain the corresponding visual features.According to the latent representations of support images transformed via the CNN backbone, the visual prototype, denoted as , can be computed as the center of each category.Intuitively, we can calculate the Euclidean distance or cosine distance between the query representation and each visual prototype to decide which category the query belongs to.However, the representation space may change in the course of time because the distributions of new classes can drift away from the base classes.Thus, the visual prototypes obtained from previous categories may shift and even overlap with new prototypes in the latent space, resulting in the performance drop of the continual learning model.

Base Learning
Naturally, the semantic knowledge of each class can provide useful information for learning visual prototypes.For example, given that "bulldog" and "cat" both belong to "animal", the prototype of cats should be quickly learned and adapted with the help of semantic similarity upon learning the semantic concept of dogs.On the other hand, the semantic similarity can provide anchors to the visual prototypes, capable of reducing the distribution drift from old prototypes.Therefore, we introduce semantic knowledge regularization to prevent the prototype from drifting with the help of Transformer adaptation.The word embedding of each category can be employed as semantic knowledge, since the distribution of word embedding may refect the semantic association among diferent categories.Specifcally, the word embedding of each category is calculated via a pre-trained model before being projected to the latent space via Projector, the learnable backbone with a linear afne layer, where the projected vector is denoted as the textual prototype of each category.
Upon obtaining the visual prototypes and textual prototypes of all support classes, we design a semantic fusion Transformer to model the complex semantic relationships among diferent classes.As shown in Figure 2 (a), the visual features which concatenate the query representation and the visual prototypes, as well as the textual prototypes, are fed into the semantic fusion Transformer to obtain the adaptive query representation and prototypes.The detailed structure of the semantic fusion Transformer is illustrated in Figure 2 (b).Through the semantic fusion Transformer, the multimodal information containing visual features as well as textual prototypes can be fused and enhanced with each other via the self-attention mechanism.The feed-forward layer is designed to map the multimodal information into a common latent space.The output of the semantic fusion Transformer is the adaptive query representation ′ , visual prototype ′ and textual prototype ′ , resulting in the following expression illustrated by Eq.( 1): where T represents the semantic fusion Transformer.We employ the adaptive textual prototype ′ as the anchor in the common latent space.The learnable visual prototype ′ is expected to align with anchor ′ .Therefore, we propose the semantic knowledge ′ regularization term in Eq.( 2) to align and ′ , indicating whether the visual and textual prototypes are semantically consistent.
where is the maximum calculated matching probability between the visual prototype ′ and textual prototype ′ . is the ground truth label which indicates the true matching between the visual prototype ′ and corresponding textual prototype.The cross entropy loss is used for the semantic consistency .Besides, the adaptive query representation ′ is categorized utilizing the nearest neighbor principle.The distance between the adaptive query representation ′ and each visual prototype ′ is ′ calculated so that is assigned to the class with the minimum distance in the latent space.Without loss of generality, we use cosine distance as the metric and cross-entropy loss as the query loss term , as shown in Eq.( 3), where is the predicted class label of the query representation ′ and is the ground truth class label of the query representation, indicating which class the query belongs to.The overall training objective is illustrated in Eq.( 4) as follows, where is the controlling factor to balance the query loss and the semantic consistency loss .

Discussions
The continual few-shot learning procedure consists of two phases: i) base learning on the base task, where the backbones and the semantic fusion Transformer are trained on the full dataset of the base task; ii) continual learning on sequentially coming new tasks, where the model frst learns visual and textual prototypes of new classes and then obtains the semantically consistent visual prototypes through Transformer adaptation.Upon learning every observable task, the proposed CoSR model is tested with all available classes.Base Learning on Base Task.As shown in Figure 2 (a), the base learning phase includes multiple episodes.In each episode, theway -shot support images as well as the query image are sampled from the database.The word embedding of all categories are precalculated to be employed as semantic information.The support images and query image are fed into the CNN backbone to obtain the visual features.We calculate the average latent representations of images in each class as the visual prototype for the corresponding category.The word embedding of each category is projected into the shared latent space with the visual prototypes through a linear afne layer.Then we use the semantic fusion Transformer depicted in Figure 2 (b) to generate adaptive visual and textual prototypes as well as the adaptive query representation.The whole model is trained in an end-to-end manner with the total objective , where the query loss aims to distinguish from diferent categories in the latent space.The proposed semantic consistency loss will encourage the alignment of the visual and textual prototypes, as well as enhance the visual prototypes with semantic information indicated in the textual prototypes.Compared to existing methods that ignore the semantic association among classes, our proposed CoSR model utilizes the semantic association among categories to facilitate the learning of visual features.The textual prototypes learned in this phase provide anchors in the latent space, which simultaneously alleviates the forgetting issue in continual learning and accelerates the learning of new knowledge.
Continual Learning on New Tasks.After the base learning phase, the new tasks are expected to arrive in sequence.We assume that the backbones have been well-trained during the base learning phase, since the visual features usually refect low-level visual information.The subsequent classifcation module then tends to play a more important role in achieving the continual few-shot learning.For the -th task, we frst obtain the visual prototypes through the newly arriving -way -shot samples and the textual prototypes of corresponding categories.The old and new textual prototypes may contain semantic information and refect relationships between different classes.We use the semantic fusion Transformer to produce adaptive visual prototypes, which are enhanced by the semantic information carried in the textual prototypes.The adaptive visual prototypes beneft from the semantic information regularization and thus are capable of reducing the estimation errors with only a very small number of samples.Finally, the query image is assigned to its best matching category whose adaptive prototype has the minimum cosine distance from the adaptive query representation via the nearest neighbor principle.The proposed semantic fusion Transformer is able to learn adaptive prototypes for novel and unseen tasks, alleviating the catastrophic forgetting issue.

EXPERIMENTS
We conduct extensive experiments to compare the proposed CoSR model with several baselines on three public datasets.To verify the efectiveness of CoSR in the real-world scenario, we further test CoSR on an industrial dataset as well.

Experimental Settings
Datasets.Following existing literature [36,44], we conduct continual few-shot learning experiments on three popular datasets, i.e., CIFAR100 [17], MiniImageNet [37] and CUB200 [41].For CIFAR100 and MiniImageNet, we sample 60 classes as the base learning task.The novel task includes 5-way 5-shot samples each, and there are 8 tasks in the continual learning stage.As for CUB200, the base learning task contains 100 classes and each novel task has 10-way 5-shot samples, with the total number of novel tasks being set to 10.In addition, we test our proposed CoSR model on Goofsh1 , an industrial dataset for online commodity purchase services, where the goal of the service provider is to recognize prohibited commodities through continual classifcation of the new items.The dataset includes images, titles, and descriptions for online items, serving as an appropriate scenario for multi-modal continual classifcation.The label of each data sample indicates the category of the item with multi-modal information, and each category has a human understandable name as semantic knowledge in the dataset.The size of Goofsh dataset is 1.8 million items with 161 classes.We split the dataset into the training and test set according to the timestamp, resulting in the training set of 1.5 million items, and the test set of 300 items.Note that the training set and test set are split without overlap.There are several classes with only 30 samples in the test set which never appear in the training set, serving as a typical continual few-shot learning setting.Semantic Knowledge.We use pre-trained vectorized word embedding as the semantic knowledge in the experiments.For CI-FAR100 and MiniImageNet, we employ 300 dimensional GloVe [26] vectors as the extracted semantic knowledge.For CUB200, the 768 dimensional Bert [9] vectors are utilized to extract semantic knowledge.For the industrial dataset, we use the pre-trained 512 dimensional embedding of class names as semantic knowledge.
Evaluations.After training on the base learning task 0 , the test accuracy is calculated on the test set of the base learning task.Then the model sequentially gets trained upon the arrival of new tasks.After learning the -th new task, the mean test accuracy is calculated on all observable tasks 0 , 1 , ..., .The evaluation metric is the fnal test accuracy over all categories when the learning of the last task fnishes.Besides, the performance drop rate (PD rate), i.e., the percentage of average accuracy drops in the last task w.r.t. the accuracy after the base task learning, is also used to measure the ability to learn new tasks while alleviating the catastrophic forgetting issue.
Baselines.We compare our CoSR with several state-of-the-art baselines.We take the "fne-tune" approach as the lower bound of the model performances, which just fne-tunes the model for each task.Several methods [3,15,27] for continual learning are compared as baselines.Existing state-of-the-art approaches [7,36,44,47] for continual few-shot learning are also tested in the experiments.
Implementations.We conduct experiments using PyTorch library.Following the common practice [44], ResNet20 [14] is employed as CNN backbone for CIFAR100 and ResNet18 [14] is utilized as CNN backbone on MiniImageNet as well as CUB200.The Projector consists of one linear afne layer, the output of which is the same as the CNN backbone.We use SGD with momentum for optimization and the learning rate is set to 0.0001.The learning rate is decayed by 0.5 every 20 epochs.The total number of training epochs in the base learning phase is 100.For all experiments, we take the average performance of 5 runs and report the fnal results.

Experimental Results on Public Datasets
CIFAR100.We conduct continual few-shot learning experiments on CIFAR100 and visualize the results in Figure 3.The fgure demonstrates the test accuracy over all observable categories after each task.We compare our method CoSR with other baselines.As shown in the fgure, the initial test accuracies after learning the base task of CEC and our CoSR are higher than other models whose test accuracy is about 64%.The reason is that CEC and our CoSR both train the model through sampling dataset in multiple episodes, while other methods train the base model with the full dataset.With the increasing of novel tasks, the test accuracy of each model drops due to the forgetting of old classes and inefcient learning of new classes with only a few samples.The traditional continual learning approaches such as iCaRL, EEIL and NCM perform worse than other methods since they usually need a large amount of data to learn new classes.The continual few-shot learning approaches can learn the novel tasks efciently, thus adapting quickly to the few-shot scenarios with a well-designed learning mechanism.Among all the methods, our CoSR has the best performances after learning all tasks.The performance drop of CoSR is also less than other methods, which indicates that our method has better ability to alleviate forgetting with the semantic knowledge regularization.
More specifcally, we report the detailed experimental results in Table 1.In addition to the test accuracy after each task, we also report the performance drop rate to demonstrate the efect of diferent approaches on the forgetting issue.As shown in the table, our CoSR performs best in the continual learning phase with the fnal test accuracy 49.69%.Among the baselines, the fne-tuning method only has 2.65% accuracy as the lower bound since it fne-tunes the model with only a few samples and ignores the learning of old classes.Similar to our CoSR, SemanKL also uses word embeddings as semantic knowledge.But SemanKL performs worse than our CoSR since it projects the visual features to the word embedding space to regularize the learning of visual features.Due to the sparsity of the semantic space, this method cannot efectively utilize semantic information to learn good visual representations.Diferently, we design a semantic fusion Transformer to fuse and align the visual and textual features, which is more efective than SemanKL and achieves higher accuracy.CEC, as a strong baseline, performs very close to our method.But our CoSR has a lower performance drop rate 31.45%compared to CEC with 32.75%, and outperforms CEC beginning at the arriving of the fourth task.The experimental MiniImageNet.To evaluate the performance of CoSR, we conduct experiments on MiniImageNet.There are eight novel tasks in the experiment, and the visualization results of the average test accuracy after each task are shown in Figure 3.The phenomenon of performance drop is similar to the result on CIFAR100.These methods, which are specially designed for continual few-shot learning, such as CEC [44], SemanKL [7], Self-promote [47] and TOPIC [36] have better performance than others for continual learning, indicating that the well-designed algorithms for few-shot learning have the ability to learn fast from a few new samples.After learning the base task, the initial test accuracies of the CEC algorithm and our CoSR algorithm are higher than other models.Our CoSR begins to achieve the best test accuracy after learning the third task, which verifes the efectiveness of our Transformer-based prototypes adaptation mechanism in few shot continual learning.
Furthermore, we report detailed experimental results on Mini-ImageNet in Table 2.In addition to the test accuracy after each task, we also report the performance drop rate to demonstrate the severity of the forgetting problem for diferent methods during the continual learning phase.As shown in the table, our CoSR algorithm has a fnal test accuracy of 47.93% after learning all tasks, outperforming other benchmark models.In the baseline model, the fne-tuning method has an accuracy of only 1.40%, and the accuracies of other methods designed only for continual learning scenarios are lower than 20%.The algorithms designed for continual few-shot learning scenarios perform relatively well.The fnal accuracy of the CEC model reached 47.09%, which is the best method among the existing baselines.In terms of performance drop rate, our algorithm has a performance drop rate of 31.35%, which is better among all algorithms.It is worth noting that the performance drop rate of the Self-promote [47] is 31.63%,which is also very close to CoSR.However, in the absolute value of the fnal accuracy, CoSR clearly outperforms the Self-promote algorithm as new tasks coming.Overall, the experimental results on MiniImageNet show the efectiveness of our CoSR algorithm in continual few-shot learning scenarios.Using semantic knowledge regularization to enhance the visual feature learning can signifcantly improve the generalization performance of the model.
CUB200.We conduct experiments on CUB200 to verify the effectiveness of the CoSR algorithm.In the CUB200 experiment, the model to learn ten new tasks, and the average test accuracy after learning each task is shown in Figure 3.In the continual learning stage, the average accuracy of all models decreased to varying degrees.Among them, the performance of the iCaRL [27], EEIL [3] and NCM [15] algorithms decreased faster, indicating that they cannot quickly learn knowledge of new categories and distinguish new categories from old ones in the continual few-shot learning scenario.The reason is that the training of these algorithms on new tasks relies on a large amount of labeled data and thus has poor performance in scenarios with only a few new samples.In contrast, methods for continual few-shot learning perform much better, among which our CoSR algorithm achieves the highest accuracy among all algorithms.This demonstrates that using semantic knowledge regularization to constrain the learning of visual features is efective.The generalization performance of the model on new categories with few samples is improved in CoSR.Meanwhile, our CoSR algorithm also outperforms other benchmark algorithms in terms of performance drop rate.
We show more detailed experimental results in Table 3, including the test accuracy after each task and the fnal performance drop rate.Through the performance drop rate, we can evaluate how well the model alleviates the catastrophic forgetting problem in the continual few-shot learning scenario.From the experimental results of CUB200, we can conclude that our CoSR algorithm performs

Experimental results on industrial datasets
Goofsh.To further evaluate the learning ability of CoSR on the industrial dataset, we conduct experiments on a multi-modal classifcation dataset with a long-tailed distribution.This dataset is a multi-modal commodity dataset in the real world, which contains multi-modal information, such as images, titles, and descriptions of online commodities.The labels of online commodities include whether one particular product is illegal and which category it belongs to.Each category has a name as the semantic knowledge corresponding to the illegal product.The base task here is to determine whether a target product belongs to the illegal categories, which is a binary classifcation task.The illegal product category has a corresponding name as the semantic knowledge for base learning and continual learning.Since the dataset has long-tail distribution, we split it into the base classes and the few-shot classes.We take the category with a relatively small amount of data as few-shot classes and learn the few-shot classes in the continual learning phase.The other categories in the full dataset are used as the base classes to train the model in the base learning phase.Note that the categories in the few-shot classes do not overlap with those in the base classes.
In order to conduct comparative experiments, we design three baseline models on the industrial dataset, namely the binary classifcation model, multi-classifcation model, and semantic-assisted multi-classifcation model (multi-classifcation++).Among them, the binary classifcation model directly classifes the given product to determine whether it is in the illegal category.The multiclassifcation model aims to classify whether the given product is illegal and if so, which illegal category it belongs to.The semanticassisted multi-classifcation model (multi-classifcation++) utilizes semantic knowledge from categorical information to assist the classifcation.Our proposed CoSR model concatenates the semantic vectors with the original visual features in order to better handle the real-world industrial scenario.We use the classifcation precision and recall on the full test set as evaluation metrics.
For the full categories, we can observe that the performance of our proposed CoSR on Goofsh dataset is improved, indicating that our CoSR model has better classifcation ability on the long-tail distributed dataset.For the few-shot categories, our CoSR model can also achieve improvement over baselines in terms of both precision and recall.Due to the small number of labeled samples in the few-shot categories, traditional classifcation models usually fail to learn efective information from these few labeled samples and thus ignore them.CoSR extracts semantic knowledge from the long-tailed distributed categories to enhance the learning of multi-modal representations.By adding constraints to multi-modal representation learning, our model can quickly estimate the optimal category prototypes in the continual few-shot learning scenario.The performance of CoSR on the categories with long-tail distribution shows that our proposed model can signifcantly enhance the continual few shot classifcation ability through semantic knowledge regularization.

CONCLUSION
In this paper, we propose a novel approach, CoSR, to tackle the problem of continual few-shot learning.The well-designed Transformer adaptation in CoSR mines complex relationships between visual signals and semantic knowledge to generate suitable visual prototypes for continual few-shot classifcation.The semantic knowledge actually provides valid anchors for the visual prototypes in continual learning, thus alleviating catastrophic forgetting signifcantly with only a limited amount of data.Extensive experiments on both public and industrial data demonstrate the superiority of our proposed CoSR model over the state-of-the-art models.For future work, more types of semantic knowledge such as knowledge graph and commonsense knowledge can be explored for further investigations.

Figure 1 :
Figure 1: The concept of our proposed CoSR model.(a) Existing works usually learn prototypes using a single modality to classify the query image.(b) Our CoSR model learns the visual prototypes with semantic knowledge regularization.This Transformer-based prototype adaptation mechanism enhances visual prototypes with the semantic association among classes and thus alleviates the forgetting issue.

Figure 2 :
Figure 2: (a) The framework of our proposed CoSR model.In the base learning phase, the backbone and the semantic fusion Transformer are trained according to the query classifcation loss and semantic consistency loss.In the continual learning phase, the model obtains the semantically consistent visual prototypes through the Transformer.The nearest neighbor principle is used for classifcation.(b) The structure illustration of our semantic fusion Transformer.Best view in color.

Figure 3 :
Figure 3: The visualization of experimental results on CIFAR100, MiniImageNet and CUB200.We compare the proposed CoSR model with state-of-the-art baselines in continual few-shot learning.Best view in color.

Table 4 :
Experimental results on industrial dataset Goofsh.We use the precision and recall on the test set as metrics.Among the benchmark models, the fnal accuracy of the fne-tuning method is only 8.47%, which is the lower bound of the benchmark models.The Self-promote, SemanKL and CEC algorithms perform relatively well, and the fnal accuracy rates are all above 30%.The fnal accuracy of the CEC algorithm is 51.30%, which is lower than the fnal accuracy of our algorithm CoSR, i.e., 51.75%.From the perspective of performance drop rate, our CoSR has the lowest performance drop rate of 30.87%, indicating that under the constraint of semantic knowledge regularization, CoSR can learn a better visual feature space and alleviate the catastrophic forgetting problem in few shot continual learning.In contrast, the performance drop rate of the CEC algorithm is 33.58%, which is worse than our CoSR algorithm.