Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

Early detection of dysplasia of the cervix is critical for cervical cancer treatment. However, automatic cervical dysplasia diagnosis via visual inspection, which is more appropriate in low-resource settings, remains a challenging problem. Though promising results have been obtained by recent deep learning models, their performance is significantly hindered by the limited scale of the available cervix datasets. Distinct from previous methods that learn from a single dataset, we propose to leverage cross-domain cervical images that were collected in different but related clinical studies to improve the model's performance on the targeted cervix dataset. To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples. We further optimize the shared feature space by aligning the cross-domain image representations simultaneously on domain level with early alignment and class level with supervised contrastive learning, which endows model training and knowledge transfer with stronger robustness. The empirical results on three real-world benchmark cervical image datasets show that our proposed method outperforms the state-of-the-art cervical dysplasia visual inspection by an absolute improvement of 4.7% in top-1 accuracy, 7.0% in precision, 1.4% in recall, 4.6% in F1 score, and 0.05 in ROC-AUC.


ABSTRACT
Early detection of dysplasia of the cervix is critical for cervical cancer treatment.However, automatic cervical dysplasia diagnosis via visual inspection, which is more appropriate in low-resource settings, remains a challenging problem.Though promising results have been obtained by recent deep learning models, their performance is significantly hindered by the limited scale of the available cervix datasets.Distinct from previous methods that learn from a single dataset, we propose to leverage cross-domain cervical images that were collected in different but related clinical studies to improve the model's performance on the targeted cervix dataset.
To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples.We further optimize the shared feature space by aligning the cross-domain image representations simultaneously on domain level with early alignment and class level with supervised contrastive learning, which endows model training and knowledge transfer with stronger robustness.The empirical results on three real-world benchmark

INTRODUCTION
Cervical cancer is one of the most common cancers for women [59], posing serious risks to their health and spreading through direct or distant metastasis [28].Especially in developing countries, it is the second most prevalent malignancy after breast cancer and the third dominant cause of cancer-related deaths [32], despite the fact that it is one of the most successfully treatable forms of cancer if diagnosed in an early stage [21].Cervical dysplasia, also known as cervical intraepithelial neoplasia (CIN), is a precancerous change indicating potential cervical cancer in an early stage.Although it can be detected via a few screening methods, most of them are conducted in a laboratory setting where special infrastructure and extensively trained personnel are needed.Such constraints significantly limit their wide deployment in low-resource regions.To accommodate the medical needs, visual inspection of the cervix after applying 5% acetic acid to the cervix epithelium (a method known in the medical community as VIA) has been advocated by the WHO because of its simplicity and low cost.In this paper, we focus on improving the performance of computational visual inspection to assist in faster and more accurate inspection.Note that colposcopic photographs from the VIA approach are referred to as cervical images in the rest of the paper.
Despite deep neural networks having been widely adopted in computer vision, attaining state-of-the-art performance usually requires vast quantities of labeled data.Unlike natural images, medical image acquisition, annotation, and analysis require significant efforts of human expertise [33] and are traditionally part of localized medical studies.Existing methods mostly perform transfer learning based on models pre-trained on natural images, particularly ImageNet [15], to alleviate this situation.While this may work well in some general instances, recent research shows that such task-agnostic transfer learning alone does not necessarily result in performance improvements for medical applications, due to the considerable visual differences between natural and medical images [48].A dearth of large task-specific datasets still stands in the way of achieving outstanding model performance.
The above findings motivate us to look for new auxiliary data sources to facilitate medical image analysis.In the field of cervical dysplasia visual inspection, we observe that multiple image datasets exist (e.g., NHS [26] and ALTS [22]), which are relevant but differ significantly in their collection environment.Intuitively, the knowledge learned from one dataset (e.g., ALTS) will be helpful to improve the robustness of a model trained on another dataset (e.g., NHS), which is, however, ignored by previous methods.We also observe that a direct utilization of existing domain adaptation/generalization methods performs unsatisfactorily due to not only (1) domain shift -datasets are collected using different devices in different environments; but also (2) criterion mismatchthe standards for ground-truth annotation can be different due to the subjective variance -the diagnosis was made by a single medical staff (e.g., nurse, doctor) purely based on visual inspection without confirming laboratory tests.
To tackle the above challenges, we present the first prototypical cross-domain knowledge transfer framework for cervical dysplasia visual inspection, which learns transferable information from an auxiliary dataset to improve the performance on the target dataset.As illustrated in Figure 1, the framework has an edge in conducting simultaneous feature alignment under two distinct levels: domain level and class level.The Early Domain Alignment (EDA) module is presented to generate domain-aligned intermediate features, followed by the Prototypical Semantic Alignment (PSA) module producing semantically-consistent high-level representations across domains.Moreover, PSA tackles the criterion mismatch challenge by identifying and reducing the impact of the auxiliary samples with high-uncertainty labels.Specifically, PSA first computes the class prototypes (i.e., the feature centroid of each class) in the target domain as the reference to generate soft assignments for auxiliary samples.Next, it measures the cross-domain label consistency by comparing the soft assignments with the ground-truth labels of the auxiliary samples.By thresholding the consistency score, we select reliable auxiliary samples and apply the supervised contrastive loss to pull together samples of the same class and push apart samples of different classes in the shared semantic space.Thereafter, semantically-consistent representations are learned across domains, which brings significant benefits for model optimization and knowledge transfer from the auxiliary to the target domain.
Here we summarize the key contributions of this paper as follows: • To the best of our knowledge, we present the first cross-domain cervical dysplasia visual inspection method, which effectively transfers knowledge from the auxiliary to the target domain.We propose to simultaneously align the intermediate features on both domain level and class level to learn transferable representations across domains.• We propose a novel prototype-based method to estimate the transferability of samples in the auxiliary domain.The impact of inconsistent labels can thus be reduced by weighting the auxiliary samples according to their transferability estimated based on the distance to the class prototypes of the target domain.

RELATED WORKS
Cervical Dysplasia Visual Inspection.A significant number of machine-learning-based methods for cervical dysplasia visual inspection have been proposed in recent years [10,16,41,55,60].
CYENet [9] and ColpoNet [50] were network architectures tailored for cervical cancer detection with cross-norm operations.Zhang et al. [8,65] introduced a split-and-aggregation framework to process the high-resolution cervical images and provided classification results by summarizing patch features.One alternative solution to leverage the high-resolution input is to train a cervix detector to generate the region of interest from the original image.Faster-RCNN [49] was adopted by Hu et al. [27] as the detector, which was trained based on their self-annotated bounding box labels.Alyafeai et al. [3] proposed a more general pipeline for detector training, following which cervical detectors can be trained using a public dataset.Park et al. [43] applied multiple augmentation schemes to the cropped images, together with a ResNet-50 structure initialized with an ImageNet pre-trained model.Some studies focused on the information integration from metadata such as Pap results [16] and HPV signal strength [60].However, only a small number of cervical images are associated with such metadata, which significantly limits the feasibility of such approaches.Domain Adaptation.Domain adaptation (DA) focuses on transferring label information from the source domain to the target domain.Existing DA methods achieved it mainly based on statistical metrics [19,37,38,46,57,63], semantic clustering [4, 23, 40, 42,  52, 62, 64], adversarial learning [11,17,18,29,36,51,53], or reconstruction [5,6,34,35,45,61].For example, CCSA [40] matched the cross-domain semantic space by aligning features based on their labels.DSN [5] proposed a disentanglement-based complex framework to separate style and content information.BrAD [23] designed an auxiliary bridge to narrow the gap between different domains.JCL [42] adopted a MoCo-like [24] structure to align unlabeled data and PAC [39] introduced a pre-training stage for model training.Since our goal differs from the DA task but shares similar properties, we select some of the existing works for comparison in our experiment.However, the asymmetrical designs of DA methods make them inappropriate to deploy in our setting, leading to worse performance compared to our framework.Contrastive Learning.Contrastive learning was initially proposed to learn high-quality representations in a self-supervised manner where the positive pairs are constructed as the multiple augmentation views of the same sample [12][13][14]24].For example, MoCo [24] proposed to use a momentum encoder and a large dictionary to improve model stability.SimCLR [12] adopted a non-linear projection head to calculate the NT-Xent loss within the latent space.Recently, contrastive learning has also been investigated under the supervised configuration, where positive pairs are defined as the same-class samples in a mini-batch [30].Multiple positive pairs are considered jointly in the calculation.In this paper, we further investigate supervised contrastive learning in a cross-domain setting for feature alignment.

PROBLEM FORMULATION
The cervical dysplasia visual inspection is usually formulated as an image classification problem based on the CIN grades (CIN0 ∼ CIN4).Such an AI medical system can behave as a useful and efficient tool in alerting potential patients to take further medical examinations in real life, especially in low-resource regions where medical resources are deficient.However, the performance of existing deep learning models for cervical dysplasia visual inspection is generally limited by small-scale cervical datasets.Moreover, the integration of multiple datasets will possibly lead to even worse performance if the aforementioned challenges of domain shift and criterion mismatch are not properly addressed.Following this path, we focus on leveraging data from two different but relevant datasets (domains1 ) to perform a more robust cervical dysplasia visual inspection.Given a target domain

PROTOTYPICAL CROSS-DOMAIN KNOWLEDGE ALIGNMENT AND TRANSFER
The architecture overview of our proposed prototypical cross-domain knowledge alignment and transfer framework is illustrated in Figure 2. As aforementioned, our framework consists of an Early Domain Alignment (EDA) module for domain-level feature alignment and a Prototypical Semantic Alignment (PSA) module for classlevel feature alignment.The PSA module further estimates the transferability of the auxiliary samples to reduce the impact of label inconsistency.By jointly optimizing the feature alignment and the classification objectives, cross-domain transferable knowledge can be effectively learned and transferred to the target domain.

Early Domain Alignment
Intuitively, a shared encoder is preferred for performance improvement if we try to introduce auxiliary data for training.However, different domains are generally occupied with differences in local feature distributions.Therefore, we adopt a Y-shape domain-adapted architecture as illustrated in Figure 2 to deal with the domain shift.
It consists of two domain-private encoders (   and    ) for local information extraction, a shared encoder for high-level semantic deduction ( ℎ ).On top of that, an early domain alignment module is introduced to reduce the gap between the intermediate representations extracted by the two domain-private encoders.
To obtain domain-invariant features, we have investigated two major approaches to narrow the gap between domains: Adversarial-based.Following [18], the adversarial-based alignment is achieved by minimizing the domain classification loss for the domain classifier , while maximizing this loss for the encoders, with the help of a gradient reversal layer.We formulate the adversarial-based objective for early domain alignment as where ŷ  is the output of the domain classifier, and    is the domain index of the input image (i.e., 0 for target and 1 for auxiliary).
Divergence-based.An alternative for domain-level alignment is divergence-based approaches, the goal of which is to minimize the distance between representations of samples from different domains.Here we investigate a widely used distance metric termed MK-MMD [37].Thereby, the divergence-based objective is to minimize the  distance between the intermediate features of   and   to narrow the gap across domains: where  is the global average pooling that maps the output of the private encoders    and    into a vector.

Prototypical Semantic Alignment
Recall that in our auxiliary dataset, labels are generally provided by a single medical staff from a local clinic.The large variance of technical skills and subjective deviations lead to the criterion mismatch challenge.To reduce the impact of such label inconsistency, we present a Prototypical Semantic Alignment (PSA) module to align the semantics of the high-level feature representations (i.e., output of  ℎ ) between the target and auxiliary domains.As shown in Figure 3, it consists of target prototype computation, prototypebased soft assignment, cross-domain label consistency examination, and contrastive feature alignment.Target Prototype Computation.We propose to compute the per-class prototypes in the target domain, and use them as references to deduce reliable and transferable information from the auxiliary domain.A prototype is defined as the center of a semantic cluster, consisting of features with the same semantic label.Compared with instance-to-instance matching [31,40], where matching is performed between cross-domain instance pairs, instance-toprototype matching is more robust to abnormal instances, thus providing a better foundation for the following steps.
Specifically, we append a classification head that consists of two fully-connected layers (i.e.,  1 and  2 ) on top of the shared encoder, and compute the prototype of each target class  after every epoch as the average of all the features from this class by where   (  ) =  1 ( ℎ (   (  ))) maps target image   to the feature before the last classification layer.   ∈ R 256 represents the prototype of class , and   = |  | is the number of target training samples.The prototypes are denoted as where  is the total number of classes.
Prototype-based Soft Assignment.The target prototypes   are next utilized to calculate a distance-based soft assignment for each auxiliary sample as follows where Cross-domain Label Consistency Examination.For an auxiliary sample    with label    = , it should be close to    in the feature space if it is well aligned with the target domain.Based on this observation, we propose to compute a cross-domain label consistency score for each auxiliary sample to measure the reliability of the knowledge learned from it.Given the calculated soft assignment probabilities    and the original auxiliary ground truth    , the cross-domain label consistency is formally defined as where   ∈ [0, 1] is the consistency score, and operator • represents the dot product of two vectors.The closer   is to 1, the higher the confidence that this auxiliary sample is within the same decision boundary of the target domain.
Contrastive Feature Alignment.Given the consistency score   calculated using Eqn. 5 and a predefined threshold   , we filter auxiliary samples by only keeping those with consistency scores   ≥   to align the high-level features.Following SimCLR [12], we apply a projection head on top of the shared encoder and perform supervised contrastive learning in the projection space [30].By pulling together samples of the same class and pushing apart samples of different classes in the projection space, it introduces consistent performance gain for classification models.In our implementation, positive pairs (  ,   ) are defined as images that belong to the same semantic class (i.e.,   =   ), while negative pairs are images that belong to different semantic classes (i.e.,   ≠   ).Both cross-domain (i.e., (   , where z  is the output of the projection head corresponding to   , and () computes the cosine similarity.  ∈  () ≡ { −   } where  is the union of the target batch and the filtered auxiliary batch based on   .  ∈  + () ≡ {  :   =   } contains all the positive pairs (  ,   ) for   in  ().The objective is to maximize the cosine similarity between positive pairs while minimizing it between negative pairs.Subsequently, the prototypical semantic alignment loss is computed as the average of the supervised contrastive loss over all valid training samples: Our proposed prototypical semantic alignment loss strengthens the matching of cross-domain samples.Considering that the visual appearances of cervical images are highly similar across cases, it can also assist in quick concentration on the most important information for classification.

Cross-Domain Knowledge Transfer
The aforementioned feature alignments assist our model to generate domain-invariant features with a global horizon.Next, we perform cross-domain knowledge transfer based on supervised classification.We adopt a threshold    for cross-domain knowledge transfer with respect to the consistency score computed in Eqn. 5. Let   ∈   and   ∈ S = {   |  ≥    } denote the training samples in the target mini-batch and in the filtered auxiliary mini-batch based on    , respectively, then the cross-entropy for classification is computed as where ŷ and   denote the prediction and the ground-truth label of image   , respectively.We further weight the auxiliary samples by  •   to enforce stronger supervision from samples with a larger crossdomain consistency.It serves as the mainstay of our framework, pushing it to constantly focus on extracting informative features for classification.

Overall Objectives
We optimize our model by jointly considering the classification loss, L   , and the feature alignment losses, L  and L  .The overall loss function of our proposed prototypical cross-domain knowledge transfer framework is formulated as where ,  are coefficients controlling the balance between the classification loss and the feature alignment loss functions.

EXPERIMENTS 5.1 Dataset
Totally 17,002 cervical images are used in our experiments, which were collected from three separate medical studies: Natural History Study of HPV and Cervical Neoplasia (NHS) [26], ASCUS-LSIL Triage Study (ALTS) [22], and Biopsy Study (Biopsy) [56].We filter the records that are labeled with ground-truth CIN grades (CIN 0,1,2,3,4) within 1 year of the screening date and formulate it as a binary classification problem to detect abnormal cases, following previous work [65].The accessibility of these datasets is based on request and constrained agreement.When compared to the state-ofthe-art, we performed two sets of experiments by utilizing NHS and Biopsy as the target dataset, respectively, while the NHS dataset is utilized in ablation studies.Please refer to the supplementary material for a detailed description of these datasets.

Implementation Details
The original resolution of cervical images is generally 2,400×1,600.Following previous work [3], we adopt a cropping scheme to select the region of interest as a preprocessing step.We adopt the ResNet-50 [25] as our backbone and initialize it with the ImageNet self-supervised model Dino [7].For training stability, we first train We conduct an ablation study to evaluate the impact of the thresholds   and    for cross-domain knowledge transfer, based on which we set   = 0.4 and    = 0.9 in the rest of the experiments.More details of implementation can be found in the supplementary material.

Comparison to the State-of-the-Art
We compare our proposed framework with nine state-of-the-art methods for both cervical dysplasia visual inspection and domain adaptation.Five commonly used classification measurements including top-1 accuracy, precision, recall, F1-score, and area under the ROC curve (ROC-AUC) are adopted as evaluation metrics.For a fair comparison, we train each model three times to reduce randomness, and report the average results together with the standard deviation of the three independent runs in Tables 1 and 2. Table 1 illustrates the performance comparison on the NHS dataset.Compared to previous cervical dysplasia visual inspection methods, our proposed framework with either divergence or adversarial alignment surpasses them by an overall large margin.Among the two candidates, the adversarial one performs better, which outperforms the second-best solution [65] by an average improvement of 4.73% in top-1 accuracy, 7.04% in precision, 1.37% in recall, 4.60% in F1 score, and 0.047 in ROC-AUC.A larger gap can be observed in various metrics against the rest of the methods [3,9,58], where an improvement of more than 10% in top-1 accuracy, 8% in precision, 6% in recall and 0.1 in ROC-AUC are generally obtained.The experimental results verify our motivation of looking for new auxiliary data sources for medical image analysis.Knowledge can be learned and transferred between medical images that were collected under different trials effectively, whereas the key challenges caused by domain shift and criterion mismatch across medical trials have to be properly solved.
Next, we compare our proposed framework to domain adaptation methods DSN [5], JCL [42], CCSA [40], PAC [39], BrAD [23], where auxiliary data are also utilized for training.Originally designed for unsupervised domain adaptation (UDA) or domain generalization, these methods can be applied to our setting by considering our target domain as the target domain in UDA and our auxiliary domain as the source domain in UDA.This approach allows us to utilize both the target and auxiliary labels by adding a target cross-entropy loss on top of their corresponding objective function.
As can be seen, despite the fact that DSN performs the best out of all the existing methods on both datasets, its improvements over cervical dysplasia visual inspection methods are somewhat limited, particularly in terms of top-1 accuracy, recall, and F1 score.We can also see that CCSA outperforms BrAD, PAC, and JCL on the NHS dataset.However, its performance degrades significantly on the Biopsy dataset (refer to Table 2).Such domain adaptation methods mainly focus on solving the domain shift challenge, while ignoring the issue caused by the label inconsistencies that potentially exist in different domains.Performing cross-domain knowledge transfer by fetching auxiliary information without selection may introduce undesirable noise to the target domain.Comparatively, we utilize the auxiliary information by first estimating its transferability and then optimizing the classification model jointly with feature alignment in the shared semantic space.From the results, we can see that our proposed solution with the adversarial module is more robust and less vulnerable to label inconsistencies across domains.It achieves the best performance in almost all five metrics and outperforms the domain adaptation methods by at least 4.17% in terms of top-1 accuracy.
Table 2 reports the performance comparison on the Biopsy dataset.This dataset is highly challenging due to the lower quality and smaller scale compared to the NHS dataset, where only 393 valid records can be utilized for model training.Similarly, our two variants outperform the existing solution in all five metrics, where the adversarial-based early alignment is still better.Compared with domain adaptation methods, it obtains the best result in four out of the five metrics.It outperforms the second best method (i.e., DSN) by 2.46% in top-1 accuracy, 18.89% in recall, and 9.89% in F1 score.Compared with Table 1, we can see that the performance of domain adaptation methods is less stable on different datasets.One reason might be their heavy dependence on the intra-supervision.Most of them were developed based on the assumption that the given labels are accurately annotated by humans, which is actually not always guaranteed in real-world applications.Our method, on the other hand, utilizes both intra-supervision and inter-supervision simultaneously, thus leading to a more practical and robust solution compared to the previous methods.

Ablation Studies
Model Architecture.We first set our loss function to L = L   +  L  with    = 0 and   = 1 to remove the impact of the PSA module, and evaluate our adversarial-based domain-level alignment in Table 3.We compare it with three counterparts given both target  data and auxiliary data, including ours without L  , ours without the domain-private encoders, and single-branch structure without both components.We can see that the top-1 accuracy degrades by 0.56% if we remove the adversarial loss L  from our method or replace the domain-private encoders with a shared encoder.A further degradation of 1.70% can be observed if both components are removed.Thus, the experimental results verify the effectiveness of our EDA module for cervical dysplasia visual inspection.
Training Strategy.Our training strategy consists of multiple loss functions as shown in Eqn. 9.Here we examine their effectiveness by removing each of them from L and report the results in Table 4.
Since the supervision from the target data is necessary for our task, we conduct this ablation study only on L  , L  , and L  _  .We observe that, compared to our proposed adversarial-based objective function, removing either one of the individual losses leads to performance degradation ranging from 2.46% to 3.03% in top-1 accuracy.Among them, L  _  and L  both serve as the bridges for integrating auxiliary knowledge, but from different aspects, thus leading to similar performance decrement.The results indicate that all our proposed losses are indispensable components of our method, which work collaboratively to complete the task.

Discussion
Statistical Prediction Distribution.Our binary setting originally comes from five categories (CIN 0,1,2,3,4) -CIN0 and CIN1 are regarded as normal, while the rest of them are regarded as abnormal.
In Figure 4(a), we visualize the statistical prediction distribution of our model for each category, where the Y-axis represents the predicted probability of belonging to the abnormal case.We can observe a distinct margin between the first two levels and the last three.The generally non-overlapped phenomenon between the upper bound of normal classes and the lower bound of abnormal classes reveals a clear decision boundary from our model.ROC Curve.Receiver operating characteristics (ROC) is a probability curve for classification problems at various threshold settings.
In Figure 4(b), we present the ROC curves from all methods in Table 1 for comparison.The closer the curve to the left-top (i.e., the larger the area under the curve (AUC)), the better capability the method has.Our method, shown in the orange line, surpasses all other methods with 0.886 in AUC.
Model Attention Visualization.Based on GradCAM [20,54], we visualize the last convolutional attention map from our framework as shown in Figure 4(c).The brighter color in the second row represents the higher-focus area.We can see that our model focuses more on the areas with obvious pathological features around the cervix, providing a more reasonable prediction for those patients.
Percentage of target data.We further compare our framework with the baseline model using different percentages of target data and report the results in Table 5.In each column, we randomly select a certain percentage of target samples for training to study the impact of the target data volume.The results show that our adversarial-based approach outperforms the baseline model by a significant margin, with improvements of 6.81%, 5.68%, 2.84%, and 3.71% achieved when the percentage is set to 10%, 20%, 50%, and 100%, respectively.These findings demonstrate that our framework is able to learn complementary and transferable information from the auxiliary data, which is particularly beneficial when the amount of labeled data in the target domain is limited.Thresholds.We investigate the impact of two key hyper-parameters,   and    , on the NHS dataset and compare the top-1 accuracy of one candidate by holding the other fixed as the best value we found during the experiment.As shown in Figure 5(a), the best classification result was obtained with   = 0.4.Decreasing   will result in supervision with noisy labels, while increasing it will result in less information learned and transferred from the auxiliary domain.For    , which defines the transferability threshold for L  _  , Figure 5(b) shows that the best classification result was obtained with    = 0.9.A similar pattern can be observed that either increasing or decreasing    will lead to performance degradation.However, compared with    , we observe that decreasing   has less impact than increasing it.This indicates that contrastive feature alignment is more tolerable with out-of-distribution semantics than direct inter-domain supervision.In supervised contrastive learning, multiple positive pairs are constructed, both in-domain and cross-domain, making it a more robust solution to the potential inconsistency between cross-domain classification boundaries.

Results on Visda-2017 Dataset
To verify the generalization capability of our model, we conduct additional experiments on Visda-2017 [47].Different from the cervix dataset, it is a large and general image dataset consisting of synthetic images and real images across 12 classes.The potential noise inside the synthetic images due to the artificial generation process is a large obstacle towards good performance.Therefore, both domain shift and label uncertainty challenges are presented in this dataset.In these experiments, we regard the real images as the Table 6: Performance comparison between our method and domain adaptation methods on the Visda-2017 dataset.
Top-1 (%) Top-5 (%) CCSA [40] 77.64 97.06 JCL [42] 78.12 97.48 BrAD [23] 83.89 98.29 DSN [5] 84.00 97.92 Ours-mkmmd 85.59 98.32 Ours-adv 87.96 98.69 target domain and the synthetic images as the auxiliary domain to evaluate our method.Compared with existing domain adaptation methods [5,23,40,42], we report the top-1 and top-5 accuracy in Table 6.We can see that our framework is able to surpass other domain adaptation methods in this general dataset, obtaining a 3.96% improvement in top-1 accuracy compared to the second-best solution.The results show that our method not only works well in small-scale medical datasets that focus on specific binary classification problem, but also generalizes well in large-scale general image datasets and multi-class classification problems.We also investigate the impact of the thresholds on Visda-2017.As shown in Figure 5(c) and (d), the best result is obtained with    = 0.6 and   = 0.4.Different from the results on the cervix dataset, a lower    is preferable on Visda-2017, possibly due to high cross-domain label consistency.To summarize, the results indicate the potential utilization of our proposed method in applications other than the medical domain, which will be explored as part of our future work.

CONCLUSION
Targeted at cervical dysplasia visual inspection, we present a novel prototypical cross-domain knowledge transfer framework to perform robust auxiliary-to-target knowledge transfer.Two key components are introduced in our method, namely the EDA module and the PSA module.The former addresses the domain shift problem by aligning the intermediate representations, while the latter utilizes a prototype-based strategy to learn useful and reliable semantic information from the auxiliary domain.Experiments on three benchmark cervical image datasets demonstrate the state-of-the-art performance of our proposed approach, with 4.7% improvement in top-1 accuracy and 0.05 in ROC-AUC.Additional result visualizations and ablation studies are presented to validate our framework design, together with the experiments on Visda-2017 dataset to demonstrate the effectiveness of our method in a more general problem setting.In the future, we plan to investigate the potential of our method in not only cross-domain but also cross-modal applications with varying label quality.

A APPENDIX A.1 Cervix dataset
We utilize a totally of 17,002 cervical images from the Natural History Study of HPV and Cervical Neoplasia (NHS) [26], ASCUS-LSIL Triage Study (ALTS) [22] and Biopsy Study (Biopsy) [56] in this paper.They are three separate clinical studies by the National Cancer Institute (NCI) during previous decades.During these projects, each patient may have participated in multiple screening sessions, where two photographs of the cervix (cervigrams) were taken during each recruitment and clinic visit as shown in Figure 6.
The cervical intraepithelial neoplasia (CIN) level normally serves as the criterion to judge the severity of cervical cancer.In our dataset, cervical images are labeled from CIN0 to CIN4, where histologic CIN2 or worse (CIN2+: CIN2, CIN3, CIN4) indicates the cancer precursor or cancer.To construct an appropriate dataset for the model training, which aims at alerting potential patients for further medical examination, we model this problem as a binary classification problem.Cases with CIN2+ are regarded as abnormal cases, while others are regarded as normal cases.Also, abnormal cases whose screening dates surpass one year are discarded due to the possible noise introduced by these samples.In this way, we have 885 images for the NHS dataset, 15,724 images for the ALTS dataset, and 393 images for the Biopsy dataset.The positive and negative ratios are 354:531 for the NHS dataset, 1961:13763 for the ALTS dataset and 151:242 for the Biopsy dataset.Two target datasets (NHS and Biopsy) in our case are not largely imbalanced, while the auxiliary dataset is.Thus, we apply a balance sampler [2] to handle

Figure 1 :
Figure 1: Illustration of our proposed method.(a) In the original feature space, direct supervised learning with auxiliary samples may degrade the model's performance in the target domain.We thus propose (b) an Early Domain Alignment (EDA) module to reduce the domain gap, and (c) a Prototypical Semantic Alignment (PSA) module to identify auxiliary samples with high-uncertainty labels (i.e., red border) and reduce their impact when aligning the representations at the semantic level.

Figure 2 :
Figure 2: The overall architecture of our proposed Prototypical Cross-domain Knowledge Alignment and Transfer.  and  ℎ denote the domain-private encoder and the shared encoder, respectively.
} and an auxiliary domain   = {  1 ,   2 , . . .,     } with labels   = {  1 ,   2 , . . .,     }, our goal is to improve the performance of the model on the target domain   with the facilitation of the auxiliary domain   .Recall that the annotation quality of our auxiliary domain may not meet the standard of the target domain due to the criterion mismatch challenge.The auxiliary labels   cannot be directly used for training.We thus propose a novel prototypical cross-domain knowledge alignment and transfer framework.Without loss of generality, we sample |  | and |  | images from   and   in each iteration, where   and   represent the target domain mini-batch and the auxiliary domain mini-batch, respectively.Next, we introduce our proposed model architecture and optimization objective based on  =   ∪   in each iteration.

Figure 3 :
Figure 3: The pipeline of proposed PSA module.

Figure 4 :
Figure 4: (a) Predicted probability statistics for each CIN grade.(b) ROC curve comparison among methods from both cervical dysplasia visual inspection and domain adaptation.(c) Visualization of model attention based on GradCAM.

Figure 6 :
Figure 6: The original cervical images from the NHS dataset.Images in the first row are normal cases while images in the second row are abnormal cases.
vector representing the probability of    belonging to target class .  (  ) =  1 ( ℎ (   (  ))) maps auxiliary image   to the shared feature space, ||  (   ) −    || 2 computes the  2 distance between   (   ) and    , and softmax is applied to normalize    by ensuring    , = 1.Thereby, an auxiliary sample    will be assigned with a large   , if it is close to the prototype of class  in the shared feature space.

Table 1 :
Performance comparison between our method and state-of-the-art methods on the NHS dataset. .For simplicity, we omit the superscript in the formulation of L  , which is given as ) or (   ,    )) and intra-domain (   ,    )

Table 2 :
Performance comparison between our method and domain adaptation methods on the Biopsy dataset.

Table 3 :
Different architectures training with both domains without filtering.

Table 4 :
Ablation study for training strategy.

Table 5 :
Performance comparison when training with different number of target samples.