Anchor-aware Deep Metric Learning for Audio-visual Retrieval

Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach.


INTRODUCTION
Metric learning endeavors to construct an embedded representation of the data, ensuring that similar pairs remain proximate while dissimilar pairs are distanced from each other within the embedding space [18,23,34,35].Metric learning serves as a cornerstone in various domains, such as image retrieval and clustering [20,35], cross-modal learning [24], person re-identification [11].In practice, as the objective of metric learning methods is to establish an embedding space where similar samples are drawn closer together while dissimilar samples are distinctly separated, it has been applied in AV-CMR task [40], by refining the similarity among data samples to enhance the efficacy of retrieval across different modalities.
Recent works [13,18] on metric learning always incorporate deep learning approaches to leverage their capability of learning robust feature representations.As such deep learning-based metric learning approaches [13] often require large amounts of data to proficiently acquire intricate feature representations, the imbalanced distribution of similar and dissimilar data pairs in the massive training dataset leads to suboptimal results of metric learning.Existing works [17,21,30,45] aim to generate synthetic hard negative samples where dissimilar data points are mistakenly identified, by employing adversarial learning [21,45] and linear interpolation [17] or non-linear interpolation [30], which captures the potential of numerous easy negatives.However, these works face challenges in controlling the number of insertion points to obtain optimal models and often overlook the need to identify robust anchors to fully capture the positive and negative distribution to refine the embedding space.Overall, the scarcity of training data points will raise the problem of missing embeddings in existing methods.This problem leads to suboptimal learning of the embedding space and compromises the quality of sample representation, including positive and negative samples, consequently impairing the performance of deep metric learning.We introduce an AA proxy derived from the correlation graph for each embedding, facilitating the migration toward optimal embedding space learning.
To address the issue, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to uncover the intrinsic correlations among existing data points and compensate for the incomplete learning of the embedding space due to insufficient data points, as illustrated in Fig. 1.Our approach comprises three fundamental components: (1) as the manifold signifies the foundational structure and relationships within the data, we construct a correlation graph within each modality to facilitate the adept capture of the innate manifold structure.(2) To effectively capture correlations among similar samples and comprehensively consider each sample's contribution to the embedding space learning process, we leverage manifold pairs from the correlation graph across modalities using an attention-driven mechanism.For example, if we select an audio sample as an anchor point and identity the  nearest audio samples relative to the anchor from the correlation graph, we create  tuples as inputs of an attention-driven model to generate an anchor-aware (AA) proxy for the anchor.(3) Leveraging AA proxies due to their ability to directly capture nuanced dependencies among similar pairs while indirectly mitigating the interference of dissimilar pairs, we employ these AA scores as sample proxies to calculate relative distances within a metric learning framework.
To validate the effectiveness of the proposed AADML approach, extensive experiments are conducted on two cross-validated audiovisual benchmark datasets.These datasets encompass annotation labels present in both audio and visual modalities and offer a comprehensive evaluation setup.Experimental results exhibit the superiority of our anchor-aware approach, achieving substantial performance enhancements, and outperforming state-of-the-art models in AV-CMR tasks by 3.0% on large dataset VEGAS and by 45.6% on small dataset AVE in terms of mean average precision (MAP), respectively.Additionally, we demonstrate the applicability of the AA proxy across various metric learning methods, resulting in performance surpassing existing state-of-the-art losses in AV-CMR tasks.
The remainder of this paper unfolds as follows: Section 2 provides an overview of related works.Section 3 elaborates on the proposed Anchor-Aware Deep Metric Learning (AADML) approach and Section 4 outlines the experimental setup and results.Lastly, in Section 6, we summarize our work and discuss potential future directions in the AV-CMR task.

RELATED WORK 2.1 Audio-visual Cross-modal Retrieval
Audio-visual cross-modal retrieval is a challenging task that aims to retrieve relevant content across different modalities such as audio and visual modalities [28,38,40].Canonical Correlation Analysis (CCA) [4] is a classical linear technique used to find correlated patterns between two sets of variables.In the context of AV-CMR, CCA-based methods have been extensively studied to learn joint representations that capture the correlations between audio and visual data.Representative approaches include traditional CCA [4], K-CCA [19], Cluster-CCA [26], Deep CCA (DCCA) [1], C-DCCA [36], Triplet with Cluster-CCA (TNN-C-CCA) [41], and VAE-CCA [43].While CCA-based methods offer interpretable embeddings, they struggle to capture complex and non-linear relationships present in real-world audio-visual data.
Deep learning is a powerful approach for representation learning in AV-CMR, which employs deep neural networks to learn joint embeddings by minimizing the distance between corresponding samples.Notably, the CLIP [25], ACMR [32], DSCMR [44], BiC-Net [10], DCIL [46], CCTL [38], EICS [39], and MSNSCA [42].These models have exhibited exceptional performance in cross-modal retrieval between audio-visual [38,39,42], positioning them as valuable contributions to the field of representation learning.While prior efforts emphasize training intricate neural networks to acquire representations for AV-CMR, our approach introduces a novel direction by focusing on advanced metric learning techniques.

Metric Learning for Retrieval
Metric learning is of paramount importance in retrieval tasks, particularly when tailored to the specific requirements of AV-CMR.The objective is to develop a suitable distance or similarity metric that preserves the underlying structure of data across different modalities to enhance retrieval accuracy.
Traditional metric learning methods such as Mahalanobis distances [5], focus on preserving the intra-class similarity while increasing the inter-class distance.However, they may struggle with capturing non-linear relationships within complex audio-visual data.The emergence of deep metric learning has gained significant traction across various retrieval tasks, including AV-CMR.Prominent techniques, such as contrastive loss [9] and triplet loss [11], aim to pull similar samples closer together and push dissimilar samples further apart in the learned embedding space.These methods excel in generating discriminative embeddings by analyzing relative distances between samples.Additionally, an array of deep metric learning variations [12,13,17,21,30,45] has been introduced.These novel deep metric learning methods offer the potential to further enhance AV-CMR, For instance, the CCTL [38] model adapts triplet loss to audio-visual cross-modal data.However, it has limitations in terms of model training difficulty and sensitivity to data scarcity situations.
The works [7,15,31] have proposed advanced metric learning in some clues.Notably, approaches such as harnessing hyperbolic geometry to capture the intricate correlations present in natural data have emerged [7,15,31], which exists its potential complexity and computational overhead.Similar problems will happen in the generator model involved in data augmentation [17,22,45] methods.These methods are an effective strategy to address data scarcity by generating synthetic samples through various transformations.However, the effectiveness of data augmentation is sensitive to the quality of synthetic points and the specific generation methods employed, raising concerns about model reliability.
Instead of relying on data augmentation or mining effective data settings to train by complex selection methods, our proposed approach efficiently computes global attention-driven dependencies between an anchor and other similar samples in parallel to dynamically weigh the correlations of their manifold structure, the correlations scores as proxies for data samples will improve the performance of AV-CMR while using such proxies for the deep metric learning.

PROBLEM FORMULATION
In audio and visual cross-modal retrieval tasks, the primary challenge arises from the intricate features of audio and visual data, coupled with their different feature distributions.This disparity makes direct comparisons using basic metrics such as distance or similarity metrics unavailable.Hence, there arises a need to devise cross-modal retrieval methods.These methods aim to map audio and visual features into a shared embedding space, facilitating the generation of new features for each modality.These features can then be utilized for comparison computations directly.
Assume an audio-visual dataset comprising  videos, symbolized as Σ = {  } −1 =0 , where each video consists an audio and visual pair, represented as   = (  ,   ).Here,   ∈  128 denotes audio features extracted by VGGish pre-trained model 1 , and   ∈  1024 signifies visual features extracted by Inception V3 pre-trained model 2 .Each   instance is associated with a semantic vector label where  represents the number of semantic categories.In this context,    is a binary value; it equals 1 if the sample belongs to the -th category ( = 0, 1, 2, ...,  − 1), and 0 otherwise.
Our objective is to feed the   and   pair into a shared label space, where the feature dimension aligns with the number of pre-defined labels.To enhance the learning of shared features, we introduce a novel metric learning approach aligned with triplet loss.This method creates a transformative proxy for each anchor.For instance, if we designate an audio sample   as an anchor, its transformative proxy is denoted as (  ).This proxy captures intricate dependencies and generates associations through attentiondriven mechanisms.Empowered by (  ), samples belonging to

APPROACH
In this section, we comprehensively present an Anchor-Aware Deep Metric Learning (AADML) model, which is a cohesive framework that combines: 4.1.correlation graph-based manifold structure, 4.2.anchor-aware (AA) proxy, and 4.3.leveraging the AA proxy in metric learning to enhance the accuracy of AV-CMR, as seen in Fig. 2.

Correlation Graph-based Manifold Structure
The utilization of manifold structure is crucial for facilitating our Anchor-Aware mechanism to accurately capture dependencies of similar pairs.We employ an approach based on correlation graphs to capture the underlying manifold structure inherent in data samples.The objective of this approach is to ensure that the similarity distance between data samples within the same manifold remains small, thereby promoting retrieval accuracy.
The -  (  ) is the k-nearest neighbor of   within the training set.The k-nearest neighbors algorithm consists of three main steps for identifying the nearest neighbors to a given reference vector   .
Firstly, we compute the cosine similarity between the   for one modality and each vector   from another modality in the dataset.The cosine similarity between two vectors   and   is defined in Eq. 2: Secondly, after computing the cosine similarity for each pair (  ,   ), we rank the computed similarities for all pairs by decreasing order as a rank list, which is computed by Eq. 3.
where given a query   and fixed the indicator , we rank the similarity between   and all the   ,  = 0, 1, ..,  − 1. Thirdly, we select the top  similarities from the ranked list  (  ,   ) to obtain the -nearest neighbors.These selected neighbors, denoted as the set -  (  ), can be represented in Eq. 4 where 1, 2, ..,  are the top  indices of rank list.
Notably, pairwise information inherently exists in cross-modal data, if the corresponding audio   is within the same manifold as audio query   , the paired visual   is also within the same manifold as   , and vice versa, seen in component (I) of the Fig. 2. With this definition, we aim to leverage the underlying data manifold of different modalities to construct the anchor-aware proxy.

Anchor-Aware (AA) Proxy
After obtaining the generated manifold pairs, the method proceeds to predict the relevance score of the anchor by leveraging its corresponding manifold pairs using the attention model.The relevance score is represented as an AA proxy for each anchor.

Correlation
Graph for AA Proxy.In this context, we select a sample from one modality as the anchor and then identity the  nearest audio samples relative to the anchor from the -NN(x)(Eq.4) in the correlation graph.This process forms three manifold pairs as key-value pairs for the attention-driven mechanism, which computes the AA proxy score for the anchor.
For instance, suppose we select the audio sample    as the designated anchor point and set the hyperparameter  as 3, we derive three manifold pairs as key-value pairs: ( To achieve this, we utilize scaled dot-product attention to compute attention scores.Then, we employ a multi-head self-attention mechanism in a parallel manner.Firstly, we aim to compute the attention score (, ,  ) for each tuple (, ,  ) by using the scaled dot-product attention mechanism, which is a fundamental aspect of AA proxy computation.The (, ,  ) score quantifies the degree of focus to allocate to other samples within the input sequence, defined by Eq. 5 where   ,   , and   are the parameter matrices of , , and  , respectively.Secondly, multi-head attention focuses on different parts of the input simultaneously by splitting the input into multiple "heads" and computing separate attention scores by projecting the queries, keys, and values ℎ times.
where ℎ is the number of attention heads,   represents the output projection matrix, which is applied to the concatenated attention heads to compute the final output of the multi-head attention layer.Finally, we compute the (•) proxy, using   as the input example.

𝐴𝐴(𝑎
where  is the hyperparameter of Eq. 4. The essence of AA proxy lies in establishing global dependencies between anchors (queries) and their semantic counterparts across both intra-and inter-modal relationships.Computed in parallel, these relationships yield a dynamic AA proxy score that encapsulates the degree of association between anchor points and analogous samples.

Leveraging the AA Proxy in Metric Learning
We explore the pragmatic applications of the AA proxy within the realm of metric learning.With (•) positioned as an innovative anchor proxy, we redefine conventional metric learning techniques to align with it.Specifically, we demonstrate the seamless integration of (•) in computing relative distances for both triplet loss [11] and contrastive loss [9].
The AA proxy effectively replaces the conventional anchor, while the positive and negative components remain unaltered.This reformulated paradigm infuses the metric learning process with heightened sensitivity to the intricate dependencies captured by (•), and harnesses the inherent parallelism within the (•), thereby yielding more accurate and robust embeddings.The (•), acting as a conduit for similarity measurement, are adeptly utilized as proxies for computing relative distances.For example, Triplet loss with AA proxy can be defined as: where  is the number of sample with one batch,   and   are the positive and negative sets of original anchor sample   .The hyperparameter  serves as the margin in triplet loss, defining a threshold that regulates the degree of dissimilarity permissible between the anchor-positive and anchor-negative pairs.The ∥•∥ 2 represents the Frobenius norm.The formula for contrastive loss can be expressed as: where  indicate the given two data points   and   is similar (y=1) or dissimilar (y=0). is the margin hyperparameter that defines the minimum distance between dissimilar pairs.Our final objective loss of AA in metric learning, such as triplet loss (TL) or contrastive loss (CL), is defined as follows: where || • ||  signifies the Frobenius norm,  () represents the projected feature in the shared label space and  (•) denotes the label representations.In the end, the optimization of the final objective loss is carried out using the stochastic gradient descent (SGD) algorithm.

EXPERIMENT
In this section, we conducted experiments to evaluate our proposed AADML model in AV-CMR tasks, we compared our method with the existing state-of-the-art methods.In order to identify whether our method can be useful for the improvement of exiting deep metric learning methods, we combine our method with them to further show the power of our method.We also perform some ablation studies to analyze the effectiveness of our method.

Dataset and Metrics
Our model achieved success in the AV-CMR task based on the assumption that audio and visual modalities share identical semantic information.As a result, we select video datasets containing audio-visual tracks and ensure that both tracks are labeled identically on the time series.We select two special video datasets: VEGAS [47] and AVE [29].Both datasets underwent a thorough process of label double-checking, ensuring uniform labeling across all frames in both modalities.The VEGAS dataset consists of 28,103 videos annotated with 10 labels, while the AVE dataset includes 1,955 videos annotated with 15 labels.We adhere to the same data partitioning strategy for training and testing sets, as well as the identical approach for feature extraction as described in the referenced work [41].
For model evaluation metrics, we adopt the same measures as those works [38,39]: mean average precision (MAP) and Precisionscope@K.The MAP functions as an evaluative metric to gauge the effectiveness of models in AV-CMR.This entails calculating the average precision (AP) for each individual query and subsequently deriving the mean value from these AP scores, providing a comprehensive measure of overall performance.The Precision-scope@K quantifies the proportion of relevant items retrieved out of the total relevant items, up to a specified rank "K" in the ranked list.The model performs better as the values of both metrics increase.

Models VEGAS Dataset AVE Dataset
Table 1: Model comparison: MAP scores of our approach versus state-of-the-art methods.The best MAP scores are presented in bold, and the second-best MAP scores are indicated with underlining.and visual inputs, each fully connected layer was configured with 1024 hidden units, and set dropout as 0.1.To project audio and visual features into a shared label space, we ensure that the feature dimension of the projected features in the label space matches the number of labels.Specifically, we maintain both the predicted and pre-defined categories at 10 for VEGAS and 15 for AVE.We achieved optimal results with batch sizes of 400 and 200, respectively, and conducted training over 400 epochs.For the margin in triplet loss and metric learning, the best results were obtained with values of 1.2 and 1.0, respectively.We implement all the experiments of our methods with the machine learning framework PyTorch 3 , and based on Ubuntu Linux 22.04.2 with NVIDIA GeForce 3080 (10G).We utilize the Adam optimizer [16] with its default parameter configuration to train our model, while the learning rate is designated as 0.0001. 3https://pytorch.org/

Comparison with Existing AV-CMR Approaches
For the purpose of evaluating the efficacy of our proposed method, a comprehensive comparison is conducted against a repertoire of 17 distinct algorithms.This ensemble encompasses seven methods grounded in Canonical Correlation Analysis (CCA), as well as eleven leading-edge techniques rooted in deep learning for crossmodal models.CCA [4] is employed to derive linear transformations for two data sets to optimize their correlation.KCCA [19] and DCCA [1] try to learn non-linear transformations by using the kernel method and deep learning, respectively.C-CCA [26] and C-DCCA [36] aim to discover transformations for both modalities by respectively clustering cross-modal data points into classes to enhance intra-cluster correlation.TNN-C-CCA [41] and VAE-CCA [43] improve C-CCA to enhance the correlation of C-CCA by triplet loss and VAE methods.ACMR [32], and AGAH [8] utilize adversarial learning to enhance cross-modal representation discrimination.DSCMR [44] and EICS [39] achieve discriminative features by introducing representation and label spaces.CLIP [25] and VideoadViser [33] learn transferable visual models from natural language supervision by using a simple pre-training task.BiC-Net [10] and MSNSCA [42] utilize transformer architecture to effectively bridge dual modalities.DCIL [46] introduces an instance loss to improve rank loss by finding appropriate triplets.Conversely, the CCTL [38] considers all the triplets across two modalities.TLCA [37] employs a progressive training approach, prioritizing easy-to-hard triplet learning over a single-pass model training strategy.The AV-CMR performances of our method and comparison approaches on two audio-visual datasets are shown in Table 1.From the results, we can observe that our proposed AADML model can achieve the best results on both datasets all over the MAP metrics, with gains of 3.50%, 2.6%, and 3.0% on VEGAS, 48.0%, 43.2%, and 45.6% on AVE in terms of A2V, V2A, and Average, which indicate the effectiveness of our proposed model.Fig. 3 illustrates the variation in average precision across scope@K, where  ranges from 10 to 1000.The curve demonstrates the superior performance of our method across all ranges.This consistency aligns with the MAP comparison presented in Table 1 when employing audio and visual as queries.

Ablation Study
5.4.1 Impact of the number of selected samples on AA.The number of selected samples in AA serves as the sole hyper-parameter in our proposed method.Fig. 4 presents an experimental analysis of the effect of varying the number of selected samples on AA with three different triplet selection strategies.This exploration aims to prove the significance of this hyperparameter.
For AA combined with either triplet loss or hard triplet loss, the system's performance displays a consistent upward track until approximately 3 samples are selected.Beyond this point, a decrement in performance is observed.In contrast, in the scenario of AA combined with triplet † loss, the performance attains its peak when 2 samples are chosen, following which a decline sets in.These trends result from the interaction between selected samples and AA methodology.In conclusion, the most optimal outcome achieved by AADML occurs through the combination of AA and Triplet techniques, particularly when the AA approach involves the selection of three samples.The first strategy involves selecting all possible triplets within a batch, where a triplet consists of an anchor (1st modality), a positive (2nd modality), and a negative (2nd modality), denoted as  .The second strategy revolves around choosing a single triplet from each batch, wherein the triplet comprises an anchor (1st modality), a positive (2nd modality), and the hardest negative (2nd modality), designated as Hard Triplet.The third strategy derived from this work [38]  where a triplet is composed of an anchor (1st or 2nd modality), a positive (1st or 2nd modality), and a negative (1st or 2nd modality), denoted as   †.
To investigate the impact of the triplet selection strategy with AA, we conduct experiments involving these three types of triplets with and without the integration of AA, as illustrated in Fig. 5.It is evident that incorporating AA significantly enhances the performance of the triplets compared to scenarios without AA overall.Remarkably, the AA+Triplet configuration achieves exceptional results during both the training and testing phases.However, the AA+Hard Triplet strategy exhibits suboptimal performance prior to approximately 150 epochs.Interestingly, before this threshold, the performance of AA+Hard Triplet even lags behind that of the baseline without AA.Notably, this phenomenon is accompanied by considerable fluctuations in the model's loss value, possibly due to the varying detection of similar pairs within different batches.The performance begins to exhibit improvement around the epoch before 150, indicating the model's convergence on the training set.5.4.3Impact of AA Proxy on Metric Learning.In this subsection, we leverage the assessment of the AA Proxy's impact on existing metric learning techniques, seen in Table 2 and Table 3.We integrate our AA proxy with existing novel deep metric learning methods including contrastive loss and triplet loss, and compare them against the state-of-the-art advanced methods designed to enhance these deep metric learning techniques.This comparative analysis aims to determine whether the combination of our AA Proxy with deep metric learning techniques can yield improvements in the performance of AV-CMR tasks.
The experimental results notably demonstrate the substantial potential of our AA proxy in enhancing the capabilities of deep metric learning methods for AV-CMR.Furthermore, our approach achieves a significant performance advantage over DSML [6], HDML [45], SupCon [14], and CrossCLR [48] methods when combined with deep metric learning techniques.
To highlight the efficacy of AA as a complementary enhancement for established deep metric learning methodologies, we extend the application of AA proxies to other advanced metric learning methods, including N-pair loss [27], Angular loss [27], Hinge loss [2], and DSL loss [3].The results conclusively reveal that employing AA proxies in conjunction with these advanced deep metric learning methods consistently yields superior results compared to scenarios where AA is not applied.

CONCLUSION
In this work, we address the challenges of metric learning, which seeks to narrow the gap between similar pairs and enhance the separation of dissimilar pairs for the AV-CMR task.While recent methods show promise by selecting impactful data points during training, the limited number of training samples restricts the full exploration of the embedding space, resulting in an incomplete representation of data distributions.To overcome this, we propose an innovative Anchor-aware Deep Metric Learning approach that adeptly navigates the embedding space even with limited data.Our method simultaneously calculates attention-driven dependencies, considering each sample as an anchor alongside its semantically similar samples.This dynamic correlation weighing within the underlying manifold structure yields Anchor-Aware (AA) scores.Leveraging the parallel computation of AA scores, we employ them as proxies to compute relative distances within the metric learning framework.Extensive experiments on benchmark datasets affirm the effectiveness of AADML, outperforming state-of-the-art models.Furthermore, our exploration of combining AA proxies with various metric learning methods underscores the potency of our approach in advancing the field.In the future, we aim to leverage our methodology for cross-modal retrieval across various modalities and to further expand its applicability into the realm of unsupervised learning.

Figure 1 :
Figure 1: This diagram illustrates the role of the anchoraware (AA) proxy in deep metric learning.Missing embeddings due to a lack of training data points leads to suboptimal learning of the embedding space.We introduce an AA proxy derived from the correlation graph for each embedding, facilitating the migration toward optimal embedding space learning.

Figure 2 :
Figure 2: The framework of our proposed model.The audio and visual features extracted by pre-trained models VGGish and Inception V3, respectively, are projected into label space as predicted label embeddings.AADML approach operates within the label space and comprises three distinct components: (I) Choosing an audio sample    as the anchor, we traverse the correlation graph to discern the  (k=3) nearest audio samples (   vs.    ) relative to the anchor, thus forming three manifold pairs as key-value pairs for (II), to compute the attention score (•) while the anchor as query (  ) with each pair:  ∈ {,, }  as key ( ∈ {,, } ),    as value (  ).The anchor-aware (•) score (pink box) is then obtained as the average of this (•) across the three key-value pairs.(III) This score is subsequently utilized as an anchor proxy for foundational metric learning methods like contrastive and triplet loss.

Figure 3 :
Figure 3: Precision-scope@K curves on the VEGAS dataset for  →  and  →  retrieval experiments, spanning different values of  from 10 to 1000.

Figure 4 :
Figure 4: MAP trends on AVE dataset: AA proxy combined with three distinct triplet methods, varying with sample selections (1 to 7) in AA.

5. 4 . 2
Impact of Triplet Selection Strategy.The triplet selection strategy directly impacts model performance by determining the quality and diversity of training triplets.Well-chosen triplets facilitate effective learning of data distribution and improve the model's ability to

Figure 5 :
Figure 5: Loss value and MAP performance of training and test set from AVE dataset.Comparative analysis of triplet loss variants: exploring Triplet †, Triplet, and Hard Triplet losses with and without AA.
These pairs are then used as inputs for the attention-driven mechanism to compute the AA proxy score for the anchor    .The resulting tuples are expressed as (, ,  ) = {(   ,    ,    ), (   ,

Table 2 :
VEGAS Dataset: The impact of anchor-aware proxy on different metric learning methods.the best MAP scores are presented in bold.

Table 3 :
entails selecting all available triplets within a batch, AVE Dataset: The impact of anchor-aware proxy on different metric learning methods.the best MAP scores are presented in bold.