Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet, which demonstrates the importance of prototypes. Although they achieve relatively good performance, the effect of multimodal information is ignored, e.g. label texts. In this work, we propose a novel MultimOdal PRototype-ENhanced Network (MORN), which uses the semantic information of label texts as multimodal information to enhance prototypes. A CLIP visual encoder and a frozen CLIP text encoder are introduced to obtain features with good multimodal initialization. Then in the visual flow, visual prototypes are computed by a visual prototype-computed module. In the text flow, a semantic-enhanced (SE) module and an inflating operation are used to obtain text prototypes. The final multimodal prototypes are then computed by a multimodal prototype-enhanced (MPE) module. Besides, we define a PRototype SImilarity DiffErence (PRIDE) to evaluate the quality of prototypes, which is used to verify our improvement on the prototype level and effectiveness of MORN. We conduct extensive experiments on four popular few-shot action recognition datasets: HMDB51, UCF101, Kinetics and SSv2, and MORN achieves state-of-the-art results. When plugging PRIDE into the training stage, the performance can be further improved.


Introduction
Few-shot learning is a challenging problem in computer vision because of the scarcity of labeled samples.Among few-shot learning problems, few-shot action recognition is one of the hardest problems for the complicated temporal evolution of videos.Most of the methods [3,4,11,19,23,26,33,43,50,51,56,57,59,61,62] for few-shot action recognition mainly fall into the metric-learning framework including a meta-training stage and a meta-test stage.Among various metric-learning works, ProtoNet [39] has been followed by many works and has far-reaching influence.It  (a) enhances the prototype but ignores the multimodal information.(b) uses multimodal information but combines two modalities on the feature level.Our multimodal prototype-enhanced framework (c) pays attention to both the semantic information of label texts and the prototype enhancement.
proposes the prototype as a representation of each category in a few-shot learning scenario, and the key objective is to obtain representative prototypes.It is efficient for distance computing and accurate for classification with prototypes.There exist two metric-learning frameworks for few-shot action recognition.One is the prototype-based framework as shown in Fig. 1a, which uses a certain enhanced strategy to compute more representative prototypes.However, methods with the prototype-based framework [23,33,62] only use the visual modality of videos and thus cannot take full advantage of the scarce video samples.The other is the multimodal-based framework as shown in Fig. 1b.It combines the information of two modalities to obtain a better distance metric.However, methods with the multimodalbased framework [11,57] ignore the importance of prototypes and only combine the two modalities on the feature level.As a result, none of them are comprehensive and fail to obtain representative prototypes.
In this case, we propose a multimodal prototypeenhanced framework to enhance prototypes with the multimodal information as shown in Fig. 1c.Inspired by Ac-tionCLIP [49], there is abundant semantic information in label texts to assist the classification.For example, if one is given several short videos of "lifting up one end of something" without being told what action it is, he may not find the same pattern in a short time.Instead, it is much easier to distinguish them from other actions when knowing the exact label.Based on this fact, we use the semantic information of label texts to enhance prototypes and assist the classification.Besides, there exist no prototype evaluation metrics to evaluate the quality of prototypes.It is necessary to propose a prototype evaluation metric and further evaluate the multimodal prototypes.
To this end, we propose a novel MultimOdal PRototype-ENhanced Network (MORN) based on our proposed framework.It is simple but well-performed, including two modality flows.In the visual flow, we introduce a CLIP [34] visual encoder to obtain more reliable features of video samples.CLIP has superior zero-shot learning ability and has been extended to few-shot classification by several works [12,58,60].Then, visual prototypes are computed by TRX [33] baseline model.In the text flow, we first make discrete templates like "a photo of action {}" as text prompts.Some works [34,49] indicate that text prompts can help bridge the distribution gap with the CLIP pre-trained model and enrich the semantic information of only one or several label words.Then, we introduce a frozen CLIP text encoder and use the prompted label texts as inputs.After that, we introduce a semantic-enhanced (SE) module to obtain text features with more reliable semantic information and the outputs are regarded as text prototypes.Finally, we combine visual prototypes and text prototypes to obtain more representative prototypes in a multimodal prototypeenhanced (MPE) module.The final multimodal prototypes are used during the distance computation with query videos.
To further evaluate the effectiveness of prototypes, we propose a metric to evaluate the quality of prototypes called PRototype SImilarity DiffErence (PRIDE).To the best of our knowledge, PRIDE is the first metric to evaluate the quality of prototypes on the ability of discriminating different categories.Inspired by [55], we first compute the real prototype of each category of all samples in the meta-test stage.Then, the similarities of the multimodal prototype in each episode with all real prototypes are computed.The difference between its own category and the average of other categories is used to evaluate the effectiveness of the prototype.More details will be further illustrated in Sec.3.4.
Our contributions can be summarized as follows: • We use the semantic information of label texts to enhance the prototypes and propose a simple but wellperformed Multimodal Prototype-Enhanced Network (MORN).
• Rather than obtaining more reliable features, we focus on computing more representative prototypes with multimodal information.Moreover, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE) to the best of our knowledge.
• We conduct extensive experiments on four popular action recognition datasets.MORN achieves state-ofthe-art results on HMDB51, UCF101, Kinetics and SSv2.

Few-Shot Image Classification
Few-shot image classification methods can be widely divided into three categories: augmentation-based, optimization-based and metric-based.Augmentation-based methods.The objective of these methods is to use augmentation techniques or extra data to increase samples for training and improve the diversity of data.Some prior attempts are intuitive, including [32,37].Besides direct data augmentation, some works focus on the semantic feature level, including [6,7].Rather than applying augmentation techniques, [30,31] introduce a GAN [14] architecture to generate extra images based on the text description to compensate for the lack of data.Optimization-based methods.The objective of these methods is to train a model under the meta-learning framework so that it can adapt to novel tasks with only a few optimization steps.[1,16,24,27,38] utilize the meta-learner as an optimizer.MAML [10] and its variants [2,20,41] aim to learn a robust model initialization.Metric-based methods.The objective of these methods is to learn feature embeddings under a certain distance metric with a better generalization ability.Samples of novel categories can be accurately classified via a nearest neighbor classifier with different distance metrics such as co-Figure 2. Overview of our proposed MORN on a 2-way 1-shot problem with 1 video for each category in the query set.In the visual flow, a CLIP visual encoder is first introduced on videos in both the support set and the query set with L frames to obtain video features.Then, support video features regarded as key and value and query video features regarded as query are passed to the Temporal-Relational CrossTransformer (TRX) module to compute visual prototypes.In the text flow, a frozen CLIP text encoder is first introduced on the prompted label texts.Then, the semantic features of label texts are passed to a semantic-enhanced (SE) module and are inflated to compute text prototypes.The visual prototypes and the text prototypes are combined through a multimodal prototype-enhanced (MPE) module.The final multimodal prototypes are obtained and are used during the distance computation with query videos.

Few-Shot Action Recognition
Most of the methods for few-shot action recognition mainly fall into the metric-learning framework.More specifically, we divide them into three categories: featurebased, multimodal-based and prototype-based.Feature-based methods.The objective of these methods is to obtain reliable video features or apply an effective alignment strategy.CMN [61], TARN [3], OTAM [4] and ARN [56] make preliminary attempts to obtain reliable video features for classification.ITANet [59] introduces a video representation based on a self-attention mechanism and an implicit temporal alignment.MASTAF [26] uses self-attention and cross-attention mechanisms to obtain reliable spatial-temporal features.MT-FAN [51] designs a motion modulator and a segment attention mechanism, and conducts a temporal fragment alignment.STRM [43] proposes a local patch-level and a global frame-level enrichment module.Similarly, HyRSM [50] proposes a hybrid relation module to enrich video features.Multimodal-based methods.The objective of these methods is to use multimodal information to assist classification under the metric-learning framework.AMeFu-Net [11] introduces depth information as an other modality to assist classification.[57] also uses the semantic information of label texts as ours and is similar to our multimodal setting.However, it fails to apply a feature extractor with a good multimodal initialization and ignores the importance of prototypes.Prototype-based methods.Prototype is the representation of each category and is proposed by ProtoNet [39] in a fewshot learning scenario.The objective of these methods is to compute representative prototypes by some prototypeenhanced strategies with unimodal information for classification.ProtoGAN [23] generates extra samples by using a conditional GAN with category prototypes.PAL [62] matches a prototype with all the query samples instead of matching a query sample with the space of prototypes.TRX [33] is our baseline model and is the most relevant to our work.It applies CrossTransformer [9] to a few-shot action recognition scenario and thus obtains query-specific category prototypes.The details of TRX will be further illustrated in Sec.3.2.
In addition, CPMT [19] can be divided into both multimodal-based and prototype-based methods.It uses object features as multimodal information and compound prototypes.Our work also utilizes multimodal information and computes representative prototypes.Furthermore, we notice the remarkable effect of the semantic information of label texts and the prototype-enhanced strategy.As a result, our proposed MORN achieves state-of-the-art results and will be shown in Sec.4.1.

Problem formulation
There are two sets in the N -way K-shot few-shot scenario: a support set , where N denotes the number of different action categories, K denotes the number of videos each category in the support set and M denotes the number of videos each category in the query set.The objective of few-shot action recognition is to learn a model with great generalization ability using only a few video samples.Specifically, the model classifies a completely novel query action to the right category by matching it with the most similar video in the support set.Meanwhile, the whole dataset is divided into a base dataset , where y i is the action category of a video sample x i .Note that the categories of C base and C novel are sample-wise non-overlapping, i.e.C base ∩ C novel = ∅.To make full use of the few video samples, we follow the episode training manner [47].Each episode contains K videos in the support set and M videos in the query set of N categories.

TRX baseline
Our work adopts Temporal-Relational CrossTransformer (TRX) [33] as baseline.It applies CrossTransformer [9] to the few-shot action recognition scenario and obtains queryspecific category prototypes.CrossTransformer combines the information of support images and query images to find their spatial correspondence through an attention operation.TRX further samples ordered sub-sequences of video frames called tuples and thus can capture higher-order temporal relationships.
To fully exploit the temporal relationships of a video with L frames, TRX firstly samples tuples of video frames: where ω is the length or cardinality of a tuple and v i is the i-th frame sampled from a video.For example, if L = 8 and ω = 2, the number of tuples is C 2 8 = 28.The set of cardinalities is denoted as Ω.Regarding query tuples Π Q ω as query and support tuples Π S ω as key and value, TRX obtains the query-specific category prototypes P N by an attention operation: Then, the distance of videos in the support set S N K and the query set Q N M is computed: where q v denotes the query video and p N v denotes the prototype in an episode.The distances T are passed as logits to a cross-entropy loss during training.More details are demonstrated in the original article [33].

MORN
The overall architecture of our proposed MORN is shown in Fig. 2, including a visual flow, a text flow and a multimodal prototype-enhanced (MPE) module.Visual flow.For each input video, we uniformly sample L frames as in [48].The video s n k in the support set and q n m in the query set can be further denoted as: where s n k , q n m ∈ R L×H×W ×3 .H denotes the height and W denotes the width of an image.Then, we apply a pre-trained CLIP visual encoder for a better multimodal initialization.The visual features of each video frame s n ki ∈ R H×W ×3 in the support set and q n mi ∈ R H×W ×3 in the query set are defined as: where The visual features of videos in the support set and the query set are denoted as: where Then, we compute visual prototypes of each episode through TRX: where Text flow.In the training stage for each sample (x i , y i ) ∈ C base , we firstly make n temp discrete templates P temp = {p temp 1 , • • • , p temp ntemp } of y i as text prompts.Then, we apply a CLIP text tokenizer to obtain the tokenized text sequences T i : where || denotes a concatenation operation.In the practical meta-training stage and meta-test stage, the template is randomly selected once per episode.Then, the features of prompted label texts are obtained by a frozen CLIP text encoder: where To obtain text features with more reliable semantic information, we further apply a semanticenhanced (SE) module g(•): where P T i ∈ R 1×dp and a multi-head attention mechanism with 4 heads is utilized as g(•).Since different frames of videos in the same category have the same label, we simply inflate P T i of the same category to keep the same dimension with visual prototypes.Then, we obtain text prototypes P T : where Multimodal prototype-enhanced module.To utilize the multimodal information to enhance prototypes, we propose a simple but well-performed multimodal prototypeenhanced (MPE) module.The choice of the MPE module is flexible containing weighted average, multi-head attention and so on.Here, we apply the weighted average to compute multimodal prototypes: where P M ∈ R N M ×dp and λ is the multimodal enhanced hyper-parameter.The multimodal prototypes are used as the final prototypes.Then, distances are computed between multimodal prototypes and videos in the query set.

PRIDE
Denote the number of categories in the meta-test stage as N novel .We first compute the real prototype by averaging all prototypes of the same category i in the novel dataset, which is based on [55]: where P i (x) is the query-specific category prototype of video x and C i novel is the i-th category in C novel .The whole set of real prototypes is denoted as: Then, the cosine similarity can be computed between a given P i (x) and P real j : (15) where sim(•, •) is the cosine similarity operation.The similarity of other categories is denoted as: We can now denote PRIDE as: PRIDE is further used to evaluate the performance of prototypes in discriminating different categories.A higher value means a better discriminating ability.

Experiments
Datasets.We evaluate our method on four popular datasets: HMDB51 [22], UCF101 [40], Kinetics [5] and Something-Something V2 (SSv2) [15].HMDB51 contains 51 action categories, each containing at least 101 clips for a total of 6,766 video clips.For HMDB51, we adopt the same protocol as [56] with 31/10/10 categories for train/val/test, respectively.UCF101 contains 101 action categories, each containing at least 100 clips for a total of 13,320 video clips.For UCF101, we also adopt the same protocol as [56] with 70/10/21 categories for train/val/test, respectively.Kinetics contains 400 action categories with 400 or more clips for each category.For Kinetics, we adopt the same protocol as [61] with 64/12/24 categories for train/val/test, respectively.SSv2 contains 220,847 videos of fine-grained actions with only subtle differences between different categories, which is regarded as a more challenging action recognition task.For SSv2, we adopt the same protocol as [4] with 64/12/24 categories for train/val/test, respectively.Implementation details.Previous works [4,19,33,43,50,51,62] utilize a ResNet-50 [17] pre-trained on ImageNet [8] as the backbone.For a fair comparison, we utilize a pretrained CLIP ResNet-50 as the visual backbone and a frozen CLIP text encoder based on a modified Transformer [46] in [35].Each video is re-scaled to height 256 and uniformly sampled L = 8 frames as in [48].We follow the TRX augmentation: random horizontal flipping and 224x224 crops in the meta-training stage, and only a center crop in the meta-test stage.We set n temp = 16, d p = d = 1024, Ω = {2, 3} and multimodal enhanced hyper-parameter λ = 0.5.According to [36], we use AdamW [28] as our optimizer with a learning rate of 10 −5 for HMDB51, UCF101, Kinetics and SSv2.We randomly sample 10000 training episodes for HMDB51, UCF101 and Kinetics, while 75000 training episodes for SSv2.We average gradients and backpropagate once every 16 iterations.In the meta-test stage, we employ the standard 5-way 5-shot evaluation on all four datasets.We randomly sample 10000 test episodes and report the average accuracy.

Comparison with State-of-the-arts
As shown in Tab. 1, we comprehensively compare four datasets for the standard 5-way 5-shot action recognition task with state-of-the-art methods.Using a multi-head attention mechanism with 8 heads as the MPE module, our proposed MORN achieves the best results of 87.1% on HMDB51, 97.7% on UCF101 and 94.6% on Kinetics and the second-best result of 71.7% on SSv2.A more detailed ablation study of the MPE module will be illustrated in Sec.4.3.CPMT [19] uses both object features and compound prototypes, thus achieves the best results before us on HMDB51, Kinetics and SSv2.It indicates the importance of multimodal information and representative prototypes.Compared to CPMT, we use the semantic information of label texts and an MPE module to achieve performance gains on HMDB51, UCF101 and Kinetics but a descent on SSv2.It is probably because some label texts in SSv2 are so similar that it is hard to discriminate them, e.g.pouring something into something against pouring something onto something.The more complex object features are more helpful in this case.In summary, MORN achieves the best results of 86.5% and 87.8% with the weighted average and the multi-head attention on average.

Multimodal Prototype-Enhanced Analysis
In this subsection, we explore the performance of our multimodal prototype-enhanced strategy.As mentioned earlier, we use PRIDE to evaluate the quality of prototypes, and further analysis shows the correlation between PRIDE and accuracy.
We randomly sample 10000 test episodes to compute PRIDE values.As shown in Tab. 2, MORN achieves the highest PRIDE values of 17.4%, 24.4% and 26.9% on HMDB51, UCF101 and Kinetics and significant gains over TRX.Besides, we also compute the PRIDE values of STRM.STRM proposes a patch-level and a frame-level enrichment module based on TRX.However, STRM utilizes enrichment modules on the feature level, and we find it performs worse than TRX on our PRIDE evaluation metric.It is mainly because STRM only focuses on obtaining more reliable features rather than computing more representative prototypes.The results correspond with our motivation for PRIDE and demonstrate the effectiveness of PRIDE.In addition, there exists a certain positive correlation between PRIDE and accuracy which will be discussed in detail below.The comparison demonstrates that focusing on representative prototypes is of great necessity.
To explore the correlation between PRIDE and accuracy, we conduct an experiment of performance gains and correlation analysis of PRIDE and accuracy on each category of HMDB51.We randomly sample 10000 test episodes to compute accuracy and PRIDE values.As shown in Fig. 3a  and 3b, MORN achieves significant gains on both PRIDE and accuracy.Furthermore, we find the two evaluation metrics follow a similar pattern, e.g.pushup ranks top-2 on PRIDE and top-1 on accuracy.To further verify the correlation between the two metrics, we conduct a Pearson correlation analysis between PRIDE and accuracy as shown in Fig. 3c.We find that 44% of the variance can be explained by a positive linear correlation between PRIDE and   accuracy.Our results further demonstrate the rationality of our proposed PRIDE evaluation metric and that PRIDE can complement the accuracy evaluation.
Intuitively, a method with a higher PRIDE value means a better discriminating performance against various categories.We randomly sample 10 prototypes in each category and the corresponding real prototypes.Then, we visualize the clusters by t-SNE [45] as shown in

Ablation Study
Design choices.As shown in Fig. 5, we compare various choices of the SE module and the multimodal enhanced hyper-parameter (λ) on HMDB51.When a two-layer MLP with a 1024-dimensional hidden layer is used as the SE module, we obtain the lowest and second-lowest classification accuracy with different multimodal enhanced hyperparameter choices.It is mainly because the MLP has a relatively simple architecture and thus fails to fully explore the semantic information of label texts.Therefore, we focus on the multi-head attention mechanism [46].As the head number increases, the fitting ability of the SE module improves as well.However, the classification accuracy fluctuates based on different head numbers and multimodal enhanced hyper-parameter choices.Firstly, the multimodal enhanced hyper-parameter determines the combination level of two modalities and is influenced by the specific scenario.Furthermore, the overfitting problem is more likely to arise in the few-shot scenario because of the scarcity of samples.We need to balance the trade-off between head numbers and multimodal enhanced hyper-parameter choices.According to our results, we utilize a multi-head attention mechanism with 4 heads as the SE module and set the multimodal enhanced hyper-parameter as 0.5 in our further experiments.For the MPE module, applying a multi-head attention mechanism can further improve the performance of MORN as shown in Tab. 3. It turns out that the multimodal information can further enhance the prototypes with a better multimodal combination strategy.Although the weighted average is relatively simple with no parameters for training, our MORN with the weighted average can also achieve the second-best results of 86.3% on HMDB51 and 96.9% on UCF101, which has already outperformed those of prior methods.For simplicity and fewer parameters for training, we use the weighted average as our default MPE module.Multimodal information.We employ the semantic information of label texts as an extra modality.To verify the importance of the multimodal information, we remove the text flow and replace the original ResNet-50 with a CLIP ResNet-50 in TRX.Different from the original ResNet-50, CLIP replaces the average pooling layer with an atten- tion pooling layer.We make no modifications to the CLIP ResNet-50 to avoid catastrophic forgetting [29], which refers to the performance decline of the pre-trained model.As shown in Tab. 4, the model achieves similar performance as TRX when removing the text flow.Specifically, the model achieves performance gains of 2.3% on HMDB51 and 0.8% on Kinetics, but a descent of 0.2% on UCF101 without the text flow.However, MORN achieves significant gains of 10.7%, 0.8% and 5.7% on HMDB51, UCF101 and Kinetics over those of TRX respectively, which verifies the importance of the semantic information of label texts.SE module and CLIP text encoder.In the text flow, we first introduce a frozen CLIP text encoder.Then, a semantic-enhanced (SE) module is used to obtain text features with more reliable semantic information.As shown in Tab. 5, MORN with the SE module and a frozen CLIP text encoder achieves the best results on HMDB51, UCF101 and Kinetics.On the one hand, the function of the SE module is similar to that of the enrichment module in [43], which is used to enrich the semantic information.On the other hand, a frozen language encoder is proven to be effective in several works [36,44] and our results further prove it.The above two modules function together and help MORN perform better.When we remove the SE module, the results are similar and are lower than MORN whether we freeze the CLIP text encoder or not.Besides, using the SE module and fine-tuning the CLIP text encoder leads to the lowest results on all three datasets.It is mainly because the overfitting problem becomes more serious as the parameters increase during training.Our results indicate the importance of the SE module and a frozen CLIP text encoder.

Conclusion
We propose a novel Multimodal Prototype-Enhanced Network (MORN) for few-shot action recognition.Besides, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE) to the best of our knowledge.Our MORN uses the semantic information of label texts as multimodal information to compute more representative prototypes.Then, we use PRIDE to evaluate multimodal prototypes and analyze the correlation between PRIDE and accuracy.MORN achieves stateof-the-art results, and our experiments show the importance of both the multimodal information and the prototypeenhanced network.Limitations.Our MORN only applies TRX as baseline.In the future, we are interested in extending our MORN to more prototype-based methods and further applying our PRIDE evaluation metric.

Figure 1 .
Figure 1.Existing metric-learning framework (a) and (b) and our multimodal prototype-enhanced framework (c) for few-shot action recognition.The oval indicates the distribution of samples and the rectangle inside indicates the prototype in the final matching space.(a) enhances the prototype but ignores the multimodal information.(b) uses multimodal information but combines two modalities on the feature level.Our multimodal prototype-enhanced framework (c) pays attention to both the semantic information of label texts and the prototype enhancement.
(a) PRIDE gains by MORN.(b) Accuracy gains by MORN.(c) Pearson correlation analysis.

Figure 3 .
Figure 3. Performance gains and correlation analysis of PRIDE and accuracy on HMDB51.MORN achieves PRIDE gains in (a) and accuracy gains in (b) in each category.(c) is a Pearson correlation analysis between PRIDE and accuracy, showing a positive correlation.

Figure 4 .
Figure 4. t-SNE [45] projection of prototypes of each episode and real prototypes on HMDB51.Our proposed MORN in (b) has a better category separability and a higher silhouette value than TRX baseline in (a).

Fig. 4 .
Compared to TRX baseline, MORN has a better category separability, and prototypes are closer to the real prototypes.To quantify the results, we compute the silhouette values of the two methods.MORN outperforms TRX by 0.117, showing that MORN computes more representative prototypes for classification.

Figure 5 .
Figure 5. Ablation study of varying the SE module and the multimodal enhanced hyper-parameter (λ) on HMDB51.We utilize a multi-head attention mechanism with 4 heads as the SE module and set the multimodal enhanced hyper-parameter as 0.5.

Table 1 .
State-of-the-art comparison on four few-shot action recognition datasets in terms of classification accuracy."Weighted average" and "Multi-head Attention" denotes different choices of the MPE module.The bold font and the underline indicate the best and the second-best results respectively.For simplicity and fewer parameters, we apply the weighted average as our default setting.

Table 2 .
Average PRIDE values across prototypes computed in C novel on HMDB51, UCF101 and Kinetics.MORN achieves the best results on all three datasets.

Table 3 .
Ablation study of the MPE module on HMDB51, UCF101 and Kinetics."Concat" indicates a concatenation operation and "Head" indicates the head number in the multi-head attention mechanism.The bold font and the underline indicate the best and the second-best results respectively.With a multi-head attention mechanism, MORN can further improve its performance.

Table 4 .
Ablation study of the multimodal information on HMDB51, UCF101 and Kinetics.MORN achieves significant gains with the semantic information of label texts.

Table 5 .
Ablation study of the SE module and the CLIP text encoder on HMDB51, UCF101 and Kinetics.With the SE module and a frozen CLIP text encoder, MORN achieves the best results on all three datasets.