Choice Over Effort: Mapping and Diagnosing Augmented Whole Slide Image Datasets with Training Dynamics

In pediatric heart transplantation, manual annotations with interob-server and intraobserver variability among cardiovascular pathology experts lead to significant disagreements about the severity of rejection. Artificial intelligence (AI)-enabled computational pathology usually requires large-scale manual annotations of gigapixel whole-slide images (WSIs) for effective model training. To address these challenges, we develop and validate an AI-enabled rare disease detection framework for automating heart transplant rejection detection from whole-slide images of pediatric patients. Specifically, we conduct a novel dataset cartography with data maps and training dynamics to map and diagnose the augmented samples, exploring the model behavior on individual instances during model training. Extensive experiments on internal and external patient cohorts have demonstrated the feasibility of both tile-level and biopsy-level detection. The proposed data-efficient learning framework may support seamless scalability to real-world rare disease detection without the burden of iterative expert annotations.


INTRODUCTION
Cardiac failure is one of the leading causes of hospital admissions, rapidly growing across the United States and globally [23].For patients diagnosed with end-stage heart failure, transplantation frequently emerges as the only viable solution [9].Although pediatric heart transplants constitute a relatively small proportion (approximately 14%) of all cardiac transplant operations, they represent an exceedingly crucial and distinct aspect with a lifelong impact on children [15].Heart transplantation, while lifesaving, carries a significant risk of organ rejection [1], which persists as the most prevalent and grave complication contributing substantially to post-transplantation mortality [8].
Artificial Intelligence (AI)-enabled clinical decision support systems have advanced the objective and automated assessment of EMBs for improving procedure reproducibility and patient operative outcomes [1, 9-11, 16, 18, 19].Recent investigations [2,3,9,12] have highlighted the potential of AI models to assist human experts across a wide range of diagnostic tasks, including heart rejection detection.However, prior endeavors [3,12] have encountered a primary challenge of small datasets with limited manual annotation of gigapixel Whole-Slide Images (WSIs), leading to poor domain adaptation.In addition, with major disagreements about the severity of rejection, learning from noisy labels is a challenge in computeraided image analysis for complex WSIs applications.
Although existing studies [9][10][11]16] have recognized the disagreement in expert annotations and interpretations of WSIs as a significant challenge in computational pathology, it remains largely unexplored on evaluating and mitigating the effect of noisy labels.In Natural Language Processing (NLP), Zhuang et al. [22] proposed a dynamics-enhanced generative model, DyGen, to leverage a largely ignored source of information, the behavior of the model on individual instances during training (i.e., training dynamics) to reveal labeling errors.Similarly, recent advances [6,17,21] in machine learning theory have sought help from training dynamics in data map to facilitate noisy label detection and prediction.
In this study, we propose a rare disease detection framework to automate heart transplant rejection detection from WSIs for pediatric patients.We implement data augmentation via different generative models (e.g., generative adversarial networks (GANs) [4] and diffusion models like Denoising Diffusion Probabilistic Models (DDPM) [5]) to facilitate data-efficient learning.Specifically, we leverage training dynamics via data map to map and diagnose both

MATERAILS AND METHODS
In this study, we present an AI-enabled clinical decision support system for detecting rare diseases, specifically focusing on automating the detection of heart transplant rejection of pediatric patients from WSIs.Given the rarity of heart transplant rejection, we employ generative models to augment the training samples for a balanced training set.These augmented training samples are then used to fine-tune a pre-trained tile-level classifier to distinguish between rejection and non-rejection cases.Subsequently, we train a separate biopsy-level classifier to estimate the rejection grade with tile-level probabilities as input.An overview of the proposed workflow is available in Figure 1.

Data Description and Pre-processing
In this study, pediatric heart biopsies were gathered from two institutions: (1) the multi-center prospective blinded study, DNA-Based Transplant Rejection Test (DTRT) [13], and (2) Children's Healthcare of Atlanta (CHOA).All experiments were conducted in compliance with relevant guidelines and regulations, with informed consent obtained from all participants.Biopsy-level annotations were acquired directly from clinicians at the source institutions.For tile-level annotations, the ground truth of Acute Cellular Rejection (ACR) regions in WSIs was obtained from clinical experts using His-tomicsTK 1 .Following multi-instance learning formulation, training tiles were labeled as 'rejection' if they overlapped more than 60% with an annotated region.For both tile-and biopsy-level model development, a train and test split was performed at the patient level, with the detailed distributions documented in Table 1.The training and validation sets were drawn from DTRT, while the external testing set was obtained from CHOA.This setup ensures a robust evaluation across varied datasets.
Initially, we segmented the WSIs to separate the tissue from the background.This segmentation was achieved using Otsu's thresholding method on a downscaled version of the WSI, where the resolution was reduced by a factor of 10.Following this, we generated non-overlapping tiles from the segmented WSI, each measuring 256 × 256 pixels at 40X magnification.This 40X magnification was chosen because it allowed multiple muscle cells and white blood cells to be included within each tile, which is crucial for detecting signs of ACR.We retained only tiles that overlapped with more than 80% of the previously identified tissue for quality control.

Data Augmentation: Image Generation
Given the rarity of heart transplant rejection, we augmented the training set with synthetically generated rejection tiles to facilitate effective pattern recognition.To accomplish this, we utilized two generative models: diffusion-and GAN-based models.

Diffusion Models.
We generated synthetic rejection tiles utilizing the state-of-the-art diffusion models.Inspired by nonequilibrium thermodynamics, diffusion models establish a Markov chain of diffusion steps to progressively introduce random noise into the data [5].The model then learns to reverse this diffusion process, thereby generating desired data samples from the noise.The noising process is defined as: where  denotes the data distribution N(  ; √︁ 1 −     −1 ,   I),   denotes the latent space, and   ∈ (0, 1) represents the variance of the noising process.With the exact reverse distribution (  −1 |  ), we can then generate synthetic image by sampling   ∼ N(0, I) and perform the reverse denoising process.Specifically, we employed a conditional image generation approach using guided diffusion.This process incorporates a classifier to enable conditional training, focusing solely on generating rejection tiles.

GANs.
We combine two variations of GANs for high-quality tile generation.Initially, we implemented a Progressive GAN (PGAN) [7] to first generate low-resolution images and progressively increase the size of the output image.Following PGAN training on all rejection and non-rejection training tiles via unconditional training, we then implemented the Inspirational GAN (IGAN) [14] to generate synthetic rejection-specific tiles.Inspirational GAN enables the creation of a synthetic image closely aligned with a chosen image by identifying optimal parameters within the latent space.This was achieved by computing the distance between the features of the selected image and those of the GAN output using a pre-trained VGG-19 model as a feature extractor.

Tile-level Classification
Utilizing transfer learning, we initialized our proposed tile-level classifier with parameters pre-trained on ImageNet.We then adjusted the original 1000-dimensional output layer to a 2-dimensional output layer for our binary classification task: identifying rejection and non-rejection.The tile-level classification approach embeds high-dimensional gigapixel WSIs into a set of compact lowdimensional feature vectors, thereby allowing for more efficient training and inference.2.4.1 Confidence.Initially, we capture the confidence of the model in assigning the true label to an observation   based on its probability distribution.We define the model confidence μ as the average probability of the true label  *  across  epochs:

Data Cartography: Training Dynamics
where   ( ) denotes the probability of the model and  ( ) represents the model parameters at the end of the th epoch.

Correctness.
Secondly, as a more intuitive statistic, we define correctness φ as the fraction of times the model correctly labels instance   across epochs :

Biopsy-level Classification
To enable rejection risk prediction at the biopsy level, we aggregate the relative frequency of rejection tiles within each biopsy image, leveraging the results from the previously discussed tile-level classifier in section 2.3.Specifically, for each biopsy image, we generate a normalized histogram of tile-level rejection probabilities.We then utilize the relative frequencies from each bin as input features to train a machine learning classifier, treating each biopsy histogram

Evaluation Metrics
For an objective evaluation, we employed several common metrics for image generation, including Inception Score, sliced Fréchet Inception Distance (sFID), Precision, and Recall.For tile-and biopsylevel classification, considering the class imbalance in the testing set as well as real-world distributions, we leveraged AUROC as our main evaluation metric.

Image Generation
For both GANs and diffusion models, we generated 330 rejection tiles to augment the training set.For qualitative evaluation, we presented examples of synthetic images generated by GANs and diffusion models in Figure 2. Compared to the original images, we observed very similar patterns in synthetic images from both GANs and diffusion.Specifically, synthetic images from diffusion models cover more diverse patterns.For quantitative evaluation, Table 2 presents the evaluation metrics of generated samples via the Inception Score, FID, Precision, and Recall.As outlined in this table, the diffusion model performs significantly better across all metrics when compared to the GAN model, with a higher Inception Score of 3.67, a lower FID score of 97.6, a higher precision score of 0.61, and a higher recall of 0.69.The experimental results indicate that diffusion could generate more diverse synthetic tiles while remaining closer to the distribution of real images, in comparison to GANs.Since we performed generative models for generating training samples, tile-and biopsy-level model performance improvement could be an indirect evaluation for effectiveness augmentation (see section 3.3 and 3.4).

Training Dynamics
In our experimental setup for generating data maps, we incorporated all  epochs into the calculation of training dynamics, beginning with the initial epoch ( = 1).Figure 3

Tile-level Classification
For tile-level classification, we conducted experiments on four stateof-the-art backbone architectures in computer vision and medical image analysis, including VGG-19, ResNet-50, ResNet-152, and DenseNet161.We used 75% DTRT patient data for training and 25% for model optimization, respectively.Tables 3 and 4 report the AUROC scores on both external and internal testing sets for heart transplant rejection classification.In the case of internal validation, the GAN-based data augmentation method achieved the highest AUROC scores across the VGG-19 (0.9884), ResNet152 (0.9806), and DenseNet161 (0.9984) models.This demonstrates that the GAN augmentation technique provided superior performance for these models when applied to the internal testing dataset.Conversely, when considering the external testing set (CHOA), the baseline models without any data augmentation surprisingly outperformed both the GANs and diffusion-based augmentations for the VGG-19, ResNet50, and ResNet152 models, with competitive AUROC scores of 0.9508, 0.9854, and 0.9863, respectively.

Biopsy-level Classification
For biopsy-level risk prediction, we performed experiments on eight different machine learning models, including Random Forest, Decision Tree, XGBoost, XGBoost-Random-Forest, Logistic Regression, Gaussian Naive Bayes, K-Nearest Neighbors, and Support Vector Machines.We selected the best classifier via five-fold cross-validation as the optimal biopsy-level classifier and reported the corresponding performance on the testing set.Tables 5 and 6 present the biopsy-level classification results for heart transplant rejection on external and internal testing sets, respectively.For the internal testing set, the diffusion augmentation method consistently outperformed the others across all evaluated models, registering AUROC scores of 0.7297 for VGG-19, 0.7615 for ResNet50, 0.7451 for ResNet152, and 0.7681 for DenseNet161.On the external testing set (CHOA), the diffusion augmentation method once again delivered the highest AUROC scores for both the VGG-19 and ResNet50 models.However, for the ResNet152 model, the GAN augmentation method emerged as the most effective, achieving an AUROC score of 0.6551.These results underscore a significant discrepancy between internal and external validations at the biopsy level, thereby highlighting the importance of external validation in assessing the generalizability of models.

DISCUSSIONS AND CONCLUSIONS
GANs and diffusion models are both powerful image-generative models capable of producing high-quality images [20].Experimental results in Figure 2b and 2c have demonstrated visually similar

Figure 1 :
Figure 1: Overview of the proposed rare disease detection framework for pediatric heart transplant rejection detection.Our proposed method first segments tissue regions in the WSI, patching them into smaller tiles.Considering the rare condition of heart transplant rejection, we employ advanced image generation approaches to augment rejection tiles for tile-level classification.Subsequently, the probabilities of tile-level rejection are used to train a biopsy-level rejection classifier, thus providing a robust basis for clinical decision-making support.Created with BioRender.com.

Table 1 :
Summary of tile-level and biopsy-level patient data: training and validation sets derived from DTRT, and external testing set sourced from CHOA.Tile-LevelTraining (DTRT) Testing (CHOA)

Formally, consider a
training dataset of size  , D = (,  * )   =1 , where the th instance consists of the observation   and its corresponding ground truth label  *  .When minimizing empirical risk, we assume the model defines a probability distribution over labels given an observation.For a stochastic gradient-based optimization, the model involves random ordering of the training instances during each epoch, across  epochs.We then define the training dynamics of instance  across the  epochs as follows:

Figure 2 :
Figure 2: Examples of (a) original tiles and generated tiles using (b) GAN and (c) diffusion model.
provides an illustrative example of a data map for the original DTRT training set based on a ResNet152 classifier.In this map, the x-axis represents variability, while the y-axis signifies confidence.The colors in the data map denote the correctness of the classifier's predictions.According to the measurements, the top-left corner of the data map, characterized by low variability and high confidence, populates the easy-to-learn examples, which form the majority of the original dataset.The examples with high variability, located on the right side of the map, are inherently ambiguous and represent complex patterns present in some tile-level instances.Conversely, the hard-to-learn examples can be found in the bottom-left corner, which is defined by low variability and low confidence.As suggested by Figure 3, these hardto-learn samples are scarce, indicating the relatively high quality of the annotations in the DTRT ground truth.

Figure 3 :
Figure 3: Data map for original DTRT training set at 10 epochs, based on a ResNet152 classifier.The x-axis shows variability; y-axis shows the confidence; and the colors indicate correctness.The top-left corner of the data map (low variability, high confidence) corresponds to easy-to-learn examples, the bottom-left corner (low variability, low confidence) corresponds to hard-to-learn examples, and examples on the right (with high variability) are ambiguous.

Table 2 :
Evaluation metrics of generated samples via inception score, sFID, precision, and recall.

Table 3 :
Tile-level classification results (AUROC) on the external testing set (CHOA) for heart rejection classification.

Table 4 :
Tile-level classification results (AUROC) on internal testing set for heart rejection classification.