Image Similarity using An Ensemble of Context-Sensitive Models

Image similarity has been extensively studied in computer vision. In recent years, machine-learned models have shown their ability to encode more semantics than traditional multivariate metrics. However, in labelling semantic similarity, assigning a numerical score to a pair of images is impractical, making the improvement and comparisons on the task difficult. In this work, we present a more intuitive approach to build and compare image similarity models based on labelled data in the form of A:R vs B:R, i.e., determining if an image A is closer to a reference image R than another image B. We address the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data by using an ensemble model. Our testing results show that the ensemble model constructed performs ~5% better than the best individual context-sensitive models. They also performed better than the models that were directly fine-tuned using mixed imagery data as well as existing deep embeddings, e.g., CLIP and DINO. This work demonstrates that context-based labelling and model training can be effective when an appropriate ensemble approach is used to alleviate the limitation due to sparse sampling.


INTRODUCTION
Similarity between images, which has been studied for decades, is crucial for various computer vision tasks, e.g., content-based retrieval [21] and image recognition [31].In recent years, deep embeddings have become available in metrics or models for estimating image similarity, especially those machine-learned (ML) models, e.g., through metric learning [16] or contrastive self-supervised Figure 1: Each arrow points to the candidate which is considered closer to the reference by the model(s) or human annotation.Visual similarity scores computed by deep models are not always aligned with human annotations.All data, annotations, and source code used for this work can be found in https://github.com/Zukang-Liao/Context-Sensitive-Image-Similaritylearning [15].However, as shown in Figure 1, deep embeddings are not always aligned with human annotations in terms of judging semantic similarity.Moreover, while similarity scores are typically of numerical values (e.g., 0.45), there is no easy way to obtain such values as ground truth data for training or testing, making it difficult to improve or compare the performance on the task of semantic similarity between images for those deep models.
Humans' perception of image similarity is often context-sensitive (CS).In some labelling processes, binary scores were assigned to image pairs in relation to a reference image (i.e., a context), i.e., is  more similar to  than .Such labelling processes have been shown to be more consistent and objective, and have been used in image retrieval [35], face recognition [29], and evaluation of synthesized images [37].In these areas, the existing databases are typically used to train models that can identify closely related images, i.e., either images (, ) or (, ) are very similar.For the general problem of image similarity,  and  can both be unrelated to , as exemplified in Figure 1.Ideally, one might wish to have a vast number of image triples (, , ) randomly selected from an image domain D. However, it would be costly to label these triples.Therefore, due to the gigantic data space and the limited amount of labelled data, directly fine-tuning deep models might not be effective (see Section 5.3).
In this work, we considered an alternative approach, with which we selected only a small set of  reference images and the labelling effort ensured adequate sampling of (, ) in the context of each selected reference image   ,  = 1.., as illustrated on the left of Figure 2.Each   group of labelled triples is referred to as a contextsensitive (CS) data cluster.We then obtained  context-sensitive (CS) models, each of which was fine-tuned on one CS data cluster (w.r.t. a reference image   ).Our experiments show that these CS models are able to improve the performance only when unseen triples contain reference images that are similar to   , e.g., when they are both flowers.We refer to such improvement as local improvement, whilst the improvement on the entire dataset as global improvement.We show how the performance of these CS models gradually improves locally but not globally during fine-tuning in Section 5.5.To fully utilize the advantage of each CS model and improve the performance on the entire dataset, we introduce two different ways to build ensemble models.To consolidate our proposed method, we compared our ensemble models with 1) existing deep embeddings, e.g., CLIP [30] and DINO [3], 2) individual CS models, 3) models directly fine-tuned on the entire dataset where all labelled triples (, , ) are amalgamated, and 4) the elementary ensemble models, e.g., majority voting.Our testing demonstrates that it is feasible to use CS data to develop models with little or very low context sensitivity, providing an efficient and effective approach for sampling image triples in the vast data space D 3 .
Contributions.In summary, (1) we revisit the problem of semantic similarity between images, and introduce a dataset with 30k labelled triples, facilitating the improvement and comparisons on the task of image semantic similarity, (2) we evaluate and compare the performance of existing deep embeddings, e.g., ViT or CLIP, and image retrieval models/algorithms, e.g., CVNet [18], on the collected dataset, (3) we found fine-tuning directly on the collected dataset not effective due to the huge data space and limited amount of labelled data, (4) we found by fixing the reference image , our CS models are able to improve the performance locally (when unseen reference images are similar to , e.g., when they are both mountains), but not globally, (5) we provide two novel methods for constructing ensemble models using our CS models to improve the performance globally, and (6) we conduct extensive experiments to compare the proposed approach with existing methods and some more conventional solutions, and we show that the proposed method is efficient and effective when data sampling is sparse and labelling resource is limited.

RELATED WORK
In the literature, prevalent feature extractors, such as histograms of gradient/color or local binary patterns mainly focus on visual attributes of images with semantic information often being overlooked.For this reason, Wang et al. [33] introduced SDML that utilized geometric mean with normalized divergences to balance inter-class divergence.Franzoni et al. [11] combined different distance measures, e.g., wordNet [22], Google similarity [4] and tested their method on 520 random pairs collected from Flickr.Similarly, Deselaers et al. [6] studied the relationship between visual and semantic similarity on ImageNet and they introduced a new distance metric which was shown effective on image classification tasks.Zhang et al. [36] introduced a differential earth mover's distance (DeepEMD) and their method was proven effective on various image classification tasks under a k-shot setting.However, unlike context-based similarity, traditional scores are not consistent among different metrics and do not always have physical interoperability.Additionally, to our best knowledge, existing datasets containing triples, where two candidates can be both different from the reference are all relatively small, e.g., 520 labelled pairs [11], or 1.7k labelled triples [35].Therefore, it is necessary to revisit the context-based similarity problem and provide a relatively larger dataset.
The BAPPS dataset [37] consists of 26.9k triples of reference and two distorted candidates (64x64 patches).They provide the two alternative forced choice (2AFC) similarity annotations for the triples.Similarly, DreamSim [12] provided 20k triples of reference and two synthesized candidates (images).D'Innocente et al. [7] [35] Images 1,752 5.7 Visual Genome [17], MS-COCO [20] Similar to the reference ✘ BAPPS (real-algo) [37]  provided 10,805 triples of women's dress images.Yoon et al. [35] ordered 1,752 triples of random images and defined a metric to evaluate the performance of image retrieval models.However, all the existing triples are carefully selected or synthesized.Therefore, at least one of the two candidates is noticeably similar to or almost the same as the reference image.In this work, we extend the study of image similarity to arbitrarily sampled candidates.
For image similarity, the data space is gigantic.Therefore, Wray et al. [34] used proxies to largely reduce the labour cost of annotation.Similarly, Movshovitz-Attias et al. [23] used static and dynamic proxies to improve models' performance on image retrieval and clustering tasks.Given an anchor image and a smaller subset of data points (candidates), they defined the proxy as the one with minimum distance to the anchor image.This way, they showed that the loss over proxies is a tight upper bound of the original one.Aziere et al. [1] trained an ensemble of CNNs using hard proxies to compute manifold similarity between images and their method was proven effective for image retrieval tasks.Similarly, Sanakoyeu et al. [28] introduced a divide and conquer training strategy which divided the embedding space into multiple sub-spaces and then assigned each training data object a learner according to which sub-space the training data object was located.

DATASET
As part of this work, we provide a new image similarity dataset (CoSIS), which currently consists of 30k labelled triples.The CoSIS dataset has 8k context-sensitive (CS) triples, which are divided into eight CS training sets (1k each) namely Indoor, City, Ocean, Field, Mountain, Forest, Flower, and Abstract.The CoSIS also contains 22k context-convolute (CC) triples, which are divided into two subsets, a validation set (12k) and a testing set (10k) for evaluating all models in an unbiased manner.As shown in Table 1, unlike existing datasets of triples in the literature, e.g., BAPPS [37], in both CS and CC portions of CoSIS, the two candidates   and   are selected randomly.Therefore, there is no guarantee that any of the candidates is semantically similar to or the same as the reference image   .The images of the triples are from the BG20k dataset [19], which consists of 20k background images.Hence the data space of the triples is of the size of ∥D 3 ∥ = (20) 3 .
Each triple is labelled by three annotators.Among the three annotators, our inter-rater reliability score is 0.947, which is higher than most cognitive tasks, e.g., emotion detection and many NLP tasks [9].In some cases, we discarded triples when: (i) annotators considered two candidates   and   were very similar and equally distanced from the reference (e.g., they are both snowy mountains), and (ii) when both   and   are totally irrelevant to the reference image   (e.g., a desert and an ocean are both almost completely irrelevant to a kitchen).With random selection, cases of (i) are relatively rare (≤ 4%), while cases of (ii) are more common (around 14%).
When training and evaluating each CS model on a CS dataset  CS  , we split the 1,000 triples into 667 for training and 333 for validation.
Context-Convolute (CC) Data.These randomly selected triples were labelled for aiding the analytical ensemble strategies and evaluating fairly the performance of all models concerned.The CC data set has 22,008 triples with 2,330 unique reference images.Three images in each triple are randomly selected, and each unique reference image has at least 9 labelled triples.Therefore, the testing results for each single reference image are reasonably statistically significant.We further split the 22k random triples into a validation set (12,006 triples with 1,320 unique reference images) and a testing set (10,002 triples with 1,010 unique reference images).The validation set is used to construct ensemble models and directly fine-tune deep models (which is not effective) for comparisons, while the testing set is used for comparing the global performance of all models/algorithms concerned.Note that the testing set does not overlap with either the validation set or any CS training set, in terms of both reference images and candidate images.
Data Cleaning.Among the three annotators, when there were disagreements, we used majority votes as the final labels.In the original labelled triples, there were some loops (e.g., with reference ,  is closer than ,  is closer than ,  is closer than ).We found only 0.11% triples that were in at least a loop, and the longest loop involved four candidates.These triples were manually removed.

METHODOLOGY
As discussed earlier, the data space of triples (, , ) is huge.Since human intelligence is typically developed in a context-sensitive manner (e.g., most people grew up in one small region), we explore a methodology for developing similarity models based on contextsensitive (CS) learning.As illustrated in Figure 3, we selected several representative reference images,   ,  = 1.., and for each   , we labelled   image pairs (, ) in relation to   .We fine-tune each CS model on one of the CS clusters.We then use the validation set of the context-convolute (CC) dataset (triples with random reference images) to conduct testing/analysis and construct ensemble models from the  CS models.Finally, we test and compare all models concerned on the testing set of the CC dataset.Due to the limited size of the annotated dataset, in this work, we focus on 1) showing that directly fine-tuning on the entire dataset is not effective (see Section 5.3), 2) demonstrating that by fixing one reference image (  ) to form a CS data cluster, our CS models are able to improve local performance, i.e., when unseen reference images are similar to   , and 3) addressing the problem of lack of labelled data by constructing ensemble of our CS models to improve global performance.We leave the studies on more advanced triplet loss and more advanced deep metric training paradigms for future work when more annotated data are available.In the following subsections, we detail the fine-tuning of CS models and the construction of ensemble models.

Fine-tuning Models
We used a simple paradigm to fine-tune our CS models on each CS data cluster.And for the purpose of comparative evaluation, we also used the same approach to fine-tune two global models,  ,1 and  ,2 .The former is trained on a mixture of CS data (fixing reference images), while the latter is on the validation set of the CC data (triples with random reference images) as shown in Figure 3.
Training Paradigm.As shown in Figure 4, we use a standard training procedure with both triplet loss and cross-entropy loss for the ranking block (binary classifier).The ranking block is helpful when the data is sparse or when the backbone is not a large model, e.g., resnet18.When the backbone is large, e.g., ViT, Lora [14] is used to reduce the number of trainable weights.
Context-based Triplet Loss.The (, , ) triples that we used in the work are conceptually similar to the traditional contrastive / triplet loss setting (anchor, positive, negative) in deep metric learning.However, instead of pulling positive examples closer to the anchor whilst pushing negative examples away from the anchor, the selection of the two candidates  and  is random.One can switch  and  or flip the annotation for similarity augmentation.Therefore, it is not always appropriate to push  or pull .Formally, we define the triplet loss function as: where  represents the backbone of an ML model (),  () denotes the embedding of an input  to ,  is a traditional distance function between two embeddings (e.g., cosine distance), and  is the annotated similarity label of the triple (  ,   ,   ) that also controls the sign of the loss function.

Ensemble Strategies
Our Ensemble Strategies are constructed based on the performance of each CS model.In this section, we first detail how we analyse the testing results of each model on the validation set of our CC dataset (triples with random reference images), as well as how we construct ensemble models based on the analysis.
Context-Convolute (CC) Testing.While one may train a set of CS models, one would like to use these models to construct an ensemble that can be applied to other contexts, i.e., when the testing triples include unseen reference images.In our CC dataset, we include multiple triples with ∥T  ∥ random reference images, and for each random reference image, there are at least nine labelled triples (see Section 3).Therefore, in addition to a global accuracy score, we can report the testing results on the CC dataset based on individual reference images.For all triples (  ,   ,   ) sharing the same reference image   , and each CS model M  ,  = 1.., the testing yields a correctness indicator.The total number of such indicators is ∥T  ∥ × .The testing can also result in additional information, such as confusion matrix and uncertainty or confidence values.The CC testing results can inform the construction of ensemble models.For an arbitrary reference image   , its feature vector Θ  determines its -D coordinates in the feature space.When all triples (  ,   ,   ) sharing the same reference image   are tested against a CS model M  , the correctness indicator can be considered as a sample of a correctness manifold at position Θ.With  CS models, we have  such manifolds based on correctness indicators.The testing of a CS model M  on the validation set provides us with a way to establish an approximate model of the manifold that can be used to predict the correctness of applying M  to a previously-unseen image triple as shown in Figure 5.
Ensemble based on Credibility Maps.An -D credibility map of a model M  is a discrete partition of the -D feature space into a number of -D cells, and each cell stores a value (or values) indicating the probability of M  to be correct when it applies to any image triples (  ,   ,   ) where the feature vector of   falls into the cell.Figure 5 illustrates such credibility maps in 1D and 2D sub-spaces.The two plots above show the 2D credibility maps of two CS models fine-tuned on the flower data cluster and ocean data cluster respectively.Each of the 2D manifolds is sampled on 12k triples with 1,320 different reference images, and the 2D feature subspace is partitioned into 200 2 cells.The two images below include four line plots representing four 1D credibility maps.Each line plot results from the projection of a set of testing results.From Figure 5, we can observe that these credibility maps can provide useful information about the past and potential performance of different CS models in different parts of the feature space.
One ensemble strategy is to determine, for any image triple (  ,   ,   ), how much each CS model should contribute to the For any previously unseen triple, the outputs can be used as the tailored ensemble weights of the CS models.
decision.The feature vector of the reference image   is used to look up relevant cells in one or more credibility maps.Likely partitioning a high-dimensional feature space will result in many empty cells.A practical solution is for an ensemble algorithm to consult several low-dimensional credibility maps for each CS model and aggregate the credibility values into a single credibility score per CS model.The scores for different CS models can then be used to determine the contribution of each CS Model in the final decision specifically for the image triple(s) (  ,   ,   ).
ML-Based Ensemble Strategy.The weights of the CS models can also be produced by another ML model, which is trained using the feature vector of   as the input and the accuracy scores on all triples (  ,   ,   ) (sharing the same reference images   ) as the correctness label.Theoretically, the weights can be jointly learned by a large ML model with a larger dataset.In this work, we show that, with a relatively small validation set (12k triples with 1,320 unique reference images), we can still train another simple ML model to predict the performance of each CS model.As shown in Figure 6, firstly, we extract features of the reference images in the validation set using a neural net (e.g., ViT), and we then use a dimensionality reduction method (e.g., PCA) to counteract the sparseness of the annotated data.In our main implementation, we used 64 dimensions.The feature of a reference image   is fed to several multi-layer perceptrons (MLPs), each of which is trained to predict the accuracy score of one CS model on all triples with   .In the deployment process, given a random triple, the MLPs estimate the likelihood score of each CS model making a correct decision.The normalized scores are used as the weights for determining the contribution of each CS model for the specific triple.

EXPERIMENT
To consolidate our methodology, we conducted extensive tests.In this section, we use experimental results to show:  • Local improvement of CS Models.We fine-tuned eight CS models M  ,  = 1..8, each of which was fine-tuned on the trainingsplit of one CS data cluster (667 triples with fixed reference image   , see Section 3).We show that our CS models are able to improve performance only when unseen triples contain reference images that are similar to   (local improvement), but not on the entire CC testing set of 10k random triples (no global improvement).
• Limited global performance if directly fine-tuning.We directly fine-tuned two global models, one on the CC validation set (12k triples with random reference images), and the other on the amalgamated CS training sets (8k in total).Due to the limited amount of labelled data, these two global models do not perform well on the CC testing set (10k).
• Effective Ensemble Models of CS models.We analyze the performance of our CS models on the CC validation set (12k), and construct ensemble models to improve the global performance on the CC testing set (10k),

Fine-tuning Context-Sensitive Models
When fine-tuning each CS model, we use the following setting: 1) learning rate: 10 −4 for ViT with LoRA, and 10 −5 for others architectures, 2) number for epochs: 25, 3) loss function: cross entropy + 0.1 × triplet loss, 4) batch size: 8, 5) optimizer: adam, 6) single image augmentation: random resized crop and horizontal flip, 7) triples augmentation: randomly swap candidate A and B, 8) all images resized to: 224×224.We did not carefully tune the hyper-parameters.Due to the small size of each CS cluster, we are able to fine-tune all of our CS models on one single laptop with an Apple M1 Chip, including ViT with LoRA.The training and fine-tuning time of each CS model varies from one day to three days (with ViT and LoRA).With the limited resources and limited amount of labelled data, we are able to improve the performance on the problem of context-sensitive image similarity using the proposed methodology.

Performance of Context-Sensitive Models
Performance on the CS Training Sets.As shown in Table 2, using embedding distances from image retrieval models achieved around 60%∼78%, and large self-supervised models, e.g., CLIP/DINO, and supervised models, e.g., ViT, were able to achieve 78%∼82%.To investigate whether we can improve the performance, we fine-tuned several models of different architectures.As shown in Table 3, as expected, the fine-tuned CS models outperform all existing methods on their corresponding CS datasets.In Table 12 in the appendix, we show more results on using different architectures.Due to the limited number of labelled data, compared with ResNet18 and VG-GNets, larger architectures might not be effective, e.g., ResNet34 and ResNet50 reached around 60%∼65% whilst ResNet18 and VGGNets reached around 81%∼84% on average.Therefore, we fine-tuned larger and deeper models with LoRA [14] to boost the performance.ViT with Lora (denoted as ViT-Lora) achieved the best averaged accuracy (around 85%), which is aligned with the most recent studies on evaluating synthesized images using labelled triples [12,37].
In Table 3, we highlight the best CS model fine-tuned on each CS data cluster.These eight CS models achieved 84%∼91%, which is significantly better than all of the existing methods in Table 2.The results suggest that our CS models can outperform existing models locally: when triples contain seen reference image   and unseen candidates   and   .In Section 5.5 and Figure 7, we visualize the CS training procedure, and the results also show that our CS models are able to improve performance locally: when the triples contain reference images that are similar to   .

Performance on the CC Validation (12k) / Testing Set (10k):
To evaluate how the selected CS models perform on triples with random unseen reference images, we run each CS model M  on the CC validation and testing set.As shown in Table 4, the CS models achieved 73%∼79.5%,which is lower than the results they achieved on the CS clusters.And the results are also slightly worse or similar to the existing models, e.g., ViT, as shown in Table 6.This shows that our CS models can only improve performance locally (on triples with similar reference images) but not globally (on triples with random reference images).

Performance of Global Models
One straightforward potential solution to improve global performance is to fine-tune deep models on triples with random reference images.Therefore, we fine-tuned two global models, M 1 and M 2 .The former was fine-tuned with an amalgamation of the eight CS data clusters (8k triples in total), and the latter with the CC validation set (12k triples with 1,320 random reference images).Three architectures were used.Table 5 shows the results of testing these models on the CC testing set (10k triples with 1,010 unique and random reference images).
The global models trained on the CC validation set (M 2 ) achieved 77%∼80% on average, which is similar to existing models, e.g., ViT: ∼80% and ResNet18: ∼77%, which are not trained on any of our labelled data.Additionally, fine-tuning on all of our context training sets (8k) (M 1 ) does not improve the performance on the testing set either.Moreover, fine-tuning one single model on all of the amalgamated context datasets led to a decrease in accuracy on the testing set (69%∼72%) compared with their untrained counterparts (77%∼80%).This might be caused by the sparsity of the context training set which only contains eight reference images.
The worse performance (compared with CLIP or DINO) of these two global models shows that: directly fine-tuning on random triples does not improve the performance due to the huge data space and limited amount of labelled data.By contrast, fixing the   reference image   hugely reduces the size of data needed to finetune the neural nets.Therefore, each single CS model is able to outperform existing algorithms on triples with seen reference images that are similar to   (Figure 7).To boost the global performance (on random triples), one plausible approach is to construct ensemble models to utilize the local improvement of each CS model.

Performance of Ensemble Models
Experiments on the Validation Set (12k).Based on each CS model's performance (e.g., visualized in Figure 5 and 7) on the validation set, we obtain the weights of the ensemble models using two methods: PCA, and MLP as specified in Section 4. As shown in Table 6, both of our ensemble models perform 8%∼10% better than existing models, the best CS model, the global models, and the simple ensemble approach (majority voting) on the validation set.The improvement is expected as the ensemble weights are constructed based on the performance of CS models on the validation set.
Experiments on the Testing Set (10k).To show the performance of our ensemble models on random unseen triples, we run the ensemble models on the CC testing set (10k triples with 1,010 random and unique reference images) which has no overlapping with the CC validation set or the CS training set (as stated in Section 3).Table 6 shows that both of our ensemble models perform ∼5% better than existing models, the best CS model, the global model, and majority voting.The results show that our analytical ensemble approaches are able to improve global performance (i.e., on random triples), and perform the best on the task of image semantic similarity.

Result Analysis and Training Visualization
Number of Context-Sensitive Models.Figure 8 shows the accuracy on the testing set of the three ensemble approaches (Majority Voting, PCA, and MLP) when using different numbers of CS models to form the ensemble models.For a number of selected CS models, we run experiments on all possible combinations.To be specific, when selecting  from the  = 8 CS models where  = {1, 2, . . ., }, we run experiments on all of the    = !(− )! *  !combinations.For the MLP-based ensemble approach, we repeat the same experiments three times for one given combination, which leads to 3 *    runs of experiments for the MLP-based ensemble approach.The results show that the MLP-based approach consistently performs the best, and both of our proposed approaches (MLP-based and PCA-based) perform constantly better than the simple ensemble method, e.g., majority voting.In addition, the results indicate that the accuracy scores on the testing set start to saturate when we use more than six CS models.This might be the reason that the field-sensitive model, forest-sensitive model, and mountain-sensitive model learn similar rules and perform similarly on the testing set.Therefore, assembling these similar CS models might not lead to a significant increase in global accuracy on the testing set.Visualization of CS Fine-tuning: We visualize the CS fine-tuning process by showing testing results (on the validation set, 12k random triples), as well as reporting a global accuracy score (on the validation set) at the end of each training epoch.Each scatter point represents the accuracy score of a CS model on all triples (  ,   ,   ) sharing the same reference image   , and the scatter point is located based on the feature vector of   (as stated in Section 4).As shown in Figure 7, we highlight the area where we see the local improvement (around   ).Whilst the local performance on the highlighted area has improved, the global accuracy (on the entire validation set) almost remains unchanged, i.e., ∼74.5% and ∼76% for the cityand flower-sensitive model.This shows that CS fine-tuning is able to improve local performance but not global performance, and we show more visualized fine-tuning processes of other CS models in Figure 10 in the appendix.7, the ensemble model of ranking blocks achieved around 78%∼84% on the validation set, and ∼70% on the testing set.The results are significantly better than any of the single CS models (binary classifier) and majority voting.However, the results are 10%∼15% worse than using embeddings as shown in Table 6, which is expected as the ranking blocks trained from scratch using 667 triples only.Similarly to Figure 8 in Section 5.5, we also show the accuracy scores of the ensemble models increase when the number of the CS models (using binary classifiers) increases in Figure 9 in the Appendix.Due to the worse performance of ranking blocks on random unseen triples, we focused on the ensemble models constructed using those CS models based on the embedding distances, rather than these binary classifiers.

Cross Validation of CS Models.
To investigate how the eight selected CS models perform on the other types of unseen reference images, we test each CS model M  on all CS datasets { 1 ,  2 , ...,   }.As shown in Table 8 in the appendix, each CS model performs the best when the reference image is the same as the ones they are trained for, i.e., the accuracy scores on the diagonal are the highest for each CS cluster.Comparing Table 8 with Table 2, we can make the following observations: (1) The fine-tuned CS models performed the best on their corresponding CS data cluster (84%∼91%), suggesting some advantages of context-sensitive training (local improvement).( 2) Close examination shows that some CS models perform reasonably well on some other CS data clusters (e.g., the Indoor Model on #City data cluster), but this does not occur consistently (e.g., the Forest Model on #Indoor and #Abstract data clusters).This suggests that (i) our CS models can be used on the data that they have not seen in some cases, and (ii) If we can predict statistically how our CS models will perform on unseen reference images via testing and analysis, we are able to produce a stronger ensemble model.

Impact of Each CS Model.
To compare how the eight CS models contribute to the ensemble model, we construct ensemble models using seven CS models with one CS model being left out.
We run each ensemble model on the testing split of each CS cluster.The results are shown in Table 9 in the appendix.All of the ensembles perform relatively satisfactorily, even on the left-out and unseen clusters.One interesting observation is that all ensemble models perform well (≥93%) on the #Mountain data cluster, including the ensemble "No Mountain Model".This suggests that the knowledge of image similarity in the context of mountains might also be learned from other CS data clusters.We also test the eight "Ensemble without  CS model" on the 10k context-convolute testing set, and the results are shown in Table 10 in the appendix.The results show CS models have different impacts on different ensemble strategies, e.g., the mountain-sensitive model is considered the most important for the MLP-based ensemble whilst the PCA-based ensemble might consider the indoor-sensitive model the most important.Compared with the results of the ensemble model using all eight CS models (Table 6), the ensemble models with seven CS models only perform slightly worse on average.

5.6.4
Meta CS models fine-tuned on mixed CS data clusters.
As inspired by [1] where randomized meta-proxies are shown to be more effective, we run experiments on meta-CS models fine-tuned on meta-CS clusters.In Table 11 in the appendix, we show the performance of meta-CS models fine-tuned on two or three CS clusters.To be more specific, we fine-tuned 1) the City/Indoor-sensitive model on the City and Indoor clusters, 2) the Nature-sensitive model on the Mountain, Forest, Ocean, and Field clusters, 3) the Objectsensitive model on the Abstract and Flower clusters.The three meta-CS models achieved 70%∼73% on average when applied to other different CS clusters, which was similar to the performance of the directly fine-tuned global models (Table 5).Additionally, compared with individual CS models (as shown in Table 3), the meta-CS models did not perform well on any of the individual CS clusters.
The results provided more evidence supporting the observation that fine-tuning a global model on the mixtures of multiple CS clusters could not improve the performance, especially when the number of CS clusters was small.Therefore, when constructing an ensemble model, we used only the CS models, each of which was fine-tuned on only one CS data cluster.

CONCLUSIONS
In this paper, we revisited the problem of image similarity and proposed a solution based on context-sensitive (CS) training datasets that contain image triples (, , ) focusing only on a few reference images.We trained a set of CS models, and our tests showed their ability to improve performance locally in their corresponding contexts but not globally when being applied to other contexts.We introduced a new approach to estimate a correctness manifold for each CS model based on imagery features and the testing results of the CS model.The estimated manifolds of CS models enable analytical ensemble strategies that predict the correctness probability of each CS model dynamically for each input triple (, , ) and determines the contribution of CS models accordingly.Our extensive experiments showed that our proposed methods performed the best in comparison with all existing models, simple ensemble models, individual CS models, and models directly fine-tuned on random triples (global models).
In addition, we have collected a dataset of 30k labelled triples, facilitating the improvement and comparisons of the task of semantic similarity between images.In future work, we will further explore the paradigm of constructing ensemble models using CS models, which in many ways bears some similarity to human learning.
All data, annotations, and source code used for this work can be found in https://github.com/Zukang-Liao/Context-Sensitive-Image-Similarity.but not global accuracy.The local improvement of the abstract-sensitive model (on the right) is less noticeable because 1) there are not too many "abstract" reference images in the validation set, and 2) the "abstract" images might not be grouped together when applying dimension reduction, e.g., tSNE or PCA.

Figure 2 :
Figure 2: Given a training set of random triples that are annotated which candidate is semantically closer to the reference, can a model learn from the training data and predict correctly for unseen triples (i.e., unseen reference images and unseen candidates)?

Figure 3 :
Figure 3: Workflow overview: each CS-model is trained on a CS data cluster.An analytical ensemble model is obtained based on the performance of each CS-model on the validation set.We also train global models using amalgamated data from the validation set and CS clusters for comparisons.

Figure 4 :
Figure 4: To train each CS model, we concatenate the embeddings and train a small ranking block to conduct binary classification.The cross-entropy loss of the ranking block, triplet loss, and LoRA [14] are used to assist in fine-tuning the backbone.

Figure 5 :
Figure 5: Ensemble Approach (PCA): for all triples (  ,   ,   ) sharing the same reference image   , we compute an accuracy score from each model.We visualize the accuracy scores of the |T  | reference images in our validation set using PCA or tSNE.Different models perform well in different areas.An ensemble method can be obtained based on the scatter plots.

Figure 6 :
Figure 6: Ensemble Approach (MLP): The input of the MLP regressors is the features of a reference image   , and the outputs are the predicted accuracy score of each CS model on all triples (  ,   ,   ) sharing the same reference image   .For any previously unseen triple, the outputs can be used as the tailored ensemble weights of the CS models.

Figure 7 :
Figure 7: Visualized CS training: the local performance (highlighted areas) is improved gradually during training, whilst the global accuracy remains stable.The second row shows the changes in binary classifiers' performance from scratch.It is more noticeable that the performance in the highlighted areas is constantly improving during training.The first row shows that we can also see the same improvement in the highlighted areas when using embeddings, especially when comparing the results at the beginning and end of CS training.This shows that CS training improves local performance for both binary classifiers and embeddings.The bluer, the more accurate the CS model that is being trained, whilst red indicates accuracy ≤75%.

Figure 8 :
Figure 8: Y-axis: accuracy of the ensemble methods on the testing set (10k).X-axis: the number of CS models used to form the ensemble model.Experiments are run on all the combinations, e.g., when choosing two CS models, we run experiments on all of the  8 2 = 8! 2! * 6! =28 combinations.For MLP, we repeat the same experiment three times for one combination, e.g., for choosing two CS models, we run 3* 8 2 =84 experiments.MLP-based approach consistently performs the best.Dashed lines inside the blobs are the quartiles of the data.

APPENDIXFigure 9 :
Figure 9: Y-axis: accuracy of the ensemble methods on the testing set (10k).X-axis: the number of CS models used to form the ensemble model.Experiments are run on all the combinations, e.g., when choosing two CS models, we run experiments on all of the  8 2 = 8! 2! * 6! =28 combinations.For MLP, we repeat the same experiment three times for one combination, e.g., for choosing two CS models, we run 3* 8 2 =84 experiments.MLP-based approach consistently performs the best.Dashed lines inside the blobs are the quartiles of the data.

Figure 10 :
Figure 10: More visualized CS training process: all of the CS models are able to improve local performance (highlighted areas)but not global accuracy.The local improvement of the abstract-sensitive model (on the right) is less noticeable because 1) there are not too many "abstract" reference images in the validation set, and 2) the "abstract" images might not be grouped together when applying dimension reduction, e.g., tSNE or PCA.

Table 1 :
Comparison with existing datasets of triples

Table 2 :
Local Performance of existing supervised and self-supervised models on different Context-Sensitive testing clusters.

Table 3 :
Local performance of different CS models (trained on the CS training dataset) on the corresponding testing dataset.

Table 4 :
No significant global improvement for CS models.As also shown in Figure7, CS models are able to achieve local improvement but not global improvement.

Table 5 :
Performance of global models on the testing set (10k)

Table 6 :
Global performance comparisons on validation set: 12k random triples and testing set: 10k random triples

Table 7 :
Performance of Ensemble Models (Ranking Block) on Validation Set: 12k triples, and Testing Set: 10k triples.Binary Ranking Blocks.In addition to embeddings, we also construct ensemble models with the binary classifiers and test the ensembles on the randomly collected triples, i.e., our validation set (12k triples) and testing set (10k triples).As shown in Table