LibAUC: A Deep Learning Library for X-Risk Optimization

This paper introduces the award-winning deep learning (DL) library called LibAUC for implementing state-of-the-art algorithms towards optimizing a family of risk functions named X-risks. X-risks refer to a family of compositional functions in which the loss function of each data point is defined in a way that contrasts the data point with a large number of others. They have broad applications in AI for solving classical and emerging problems, including but not limited to classification for imbalanced data (CID), learning to rank (LTR), and contrastive learning of representations (CLR). The motivation of developing LibAUC is to address the convergence issues of existing libraries for solving these problems. In particular, existing libraries may not converge or require very large mini-batch sizes in order to attain good performance for these problems, due to the usage of the standard mini-batch technique in the empirical risk minimization (ERM) framework. Our library is for deep X-risk optimization (DXO) that has achieved great success in solving a variety of tasks for CID, LTR and CLR. The contributions of this paper include: (1) It introduces a new mini-batch based pipeline for implementing DXO algorithms, which differs from existing DL pipeline in the design of controlled data samplers and dynamic mini-batch losses; (2) It provides extensive benchmarking experiments for ablation studies and comparison with existing libraries. The LibAUC library features scalable performance for millions of items to be contrasted, faster and better convergence than existing libraries for optimizing X-risks, seamless PyTorch deployment and versatile APIs for various loss optimization. Our library is available to the open source community at https://github.com/Optimization-AI/LibAUC, to facilitate further academic research and industrial applications.

However, it has been observed that these existing platforms and libraries have encountered some unique challenges when solving some classical and emerging problems in AI, including classification for imbalanced data (CID), learning to rank (LTR), contrastive learning of representations (CLR).In particular, prior works have observed that large mini-batch sizes are necessary to attain good performance for these problems [4,5,7,37,43,46], which restricts the capabilities of these AI models in the real-world.The reason for this issue is two-fold.First, the standard empirical risk minimization (ERM) framework, which serves as the foundation of the standard mini-batch based methods, does not provide a good abstraction for many non-decomposable objectives in ML and ignores their inherent complexities.Second, all existing DL libraries are developed based on the standard mini-batch based technique for ERM, which updates model parameters based on the gradient of a mini-batch loss as an approximation for the objective on the whole data set.
To address the first issue, a novel learning paradigm named deep X-risk optimization (DXO) was recently introduced [60], which provides a unified framework to abstract the optimization of many compositional loss functions, including surrogate losses for AUROC, AUPRC/AP, and partial AUROC that are suitable for CID [39,64,65], surrogate losses for NDCG, top- NDCG, and listwise losses that are used in LTR [41], and global contrastive losses for CLR [63].To address the second issue, the LibAUC library implemented state-ofthe-art algorithms for optimizing a variety of X-risks arising in CID, LTR and CLR.It has been used by many projects [8,10,19,23,45,57] and achieved great success in solving real-world problems, e.g., the 1st Place at the Stanford CheXpert Competition [64] and MIT AICures Challenge [56].Hence, it deserves in-depth discussions about the design principles and unique features to facilitate future research and development for DXO.This paper aims to present the underlying design principles of the LibAUC library and provide a comprehensive study of the library regarding its unique features of design and superior performance compared to existing libraries.The unique design features of the LibAUC library include (i) dynamic mini-batch losses, which are designed for computing the stochastic gradients of X-risks by automatic differentiation to ensure the convergence; (ii) controlled data samplers, which differ from standard random data samplers in that the ratio of the number of positive data to the number of negative data can be controlled and tuned to boost the performance.The superiority of the LibAUC library lies in: (i) it is scalable to millions of items to be ranked or contrasted with respect to an anchor data; (ii) it is robust to small mini-batch sizes due to that all implemented algorithms have theoretical convergence guarantee regardless of mini-batch sizes; and (iii) it converges faster and to better solutions than existing libraries for optimizing a variety of compositional losses/measures suitable for CID, LTR and CLR.
To the best of our knowledge, LibAUC is the first DL library that provides easy-to-use APIs for optimizing a wide range of X-risks.Our main contributions for this work are summarized as follows: • We propose a novel DL pipeline to support efficient implementation of DXO algorithms, and provide implementation details of two unique features of our pipeline, namely dynamic mini-batch losses and controlled data samplers.• We present extensive empirical studies to demonstrate the effectiveness of the unique features of the LibAUC library, and the superior performance of LibAUC compared to existing DL libraries/approaches for solving the three tasks, i.e., CID, LTR and CLR.

DEEP X-RISK OPTIMIZATION (DXO)
This section provides necessary background about DXO.We refer readers to [60] for more discussions about theoretical guarantees.

A Brief History
The min-max optimization for deep AUROC maximization was studied in several earlier works [32,64].Later, deep AUPRC/AP maximization was proposed by Qi et al. [39], which formulates the problem as a novel class of finite-sum coupled compositional optimization (FCCO) problem.The algorithm design and analysis for FCCO were improved in subsequent works [26,50,51].Recently, the FCCO techniques were used for partial AUC maximization [65], NDCG and top- NDCG optimization [41], and stochastic optimization of global contrastive losses with a small batch size [63].
More recently, Yang et al. [60] proposed the X-risk optimization framework, which aims to provide a unified venue for studying the optimization of different X-risks.The difference between this work and these previous works is that we aim to provide a technical justification for the library design towards implementing DXO algorithms for practical usage, and comprehensive studies of unique features and superiority of LibAUC over existing DL libraries.

Notations
For CID, let S = {(x 1 , } denote all relevant query-item (Q-I) pairs.Denote by ℎ w (x; ) : X × Q → R a parametric predictive function that outputs a predicted relevance score for x with respect to .
For CLR, let S = {x 1 , . . ., x  } denote a set of anchor data, and let S −  denote a set containing all negative samples with respect to x  .For unimodal SSL, S −  can be constructed by applying different data augmentations to all data excluding x  .For bimodal SSL, S −  can be constructed by including the different view of all data excluding x  .The goal of representation learning is to learn a feature encoder network ℎ w (•) ∈ R  o parameterized by a vector w ∈ R  that outputs an encoded feature vector for an input data .

The X-Risk Optimization Framework
We use the following definition of X-risks given by [60].Definition 1. ( [60]) X-risks refer to a family of compositional measures in which the loss function of each data point is defined in a way that contrasts the data point with a large number of others.Mathematically, X-risk optimization can be cast into the following abstract optimization problem: where  : R  ↦ → R is a mapping,   : R ↦ → R is a simple deterministic function, S = {z 1 , . . ., z  } denotes a target set of data points, and S  denotes a reference set of data points dependent or independent of z  .
The most common form of (w; z, S) is the following: As a result, many DXO problems will be formulated as FCCO [50]: The FCCO problem is subtly different from the traditional stochastic compositional optimization [52] due to the coupling of a pair of data in the inner function.Almost all X-risks considered in this paper, including AUROC, AUPRC/AP, pAUC, NDCG, top- NDCG, listwise CE loss, GCL, can be formulated as FCCO or its variants.
Besides the common formulation above, in the development of LibAUC library two other optimization problems are also used, including the min-max optimization and multi-block bilevel optimization.The min-max formulation is used to formulate a family of surrogate losses of AUROC, and the multi-block bilevel optimization is useful for formulating ranking performance measures defined only on top- items in the ranked list, including top- NDCG, precision at a certain recall level, etc.In summary, we present a mapping of different X-risks to different optimization problems in Figure 1, which is a simplified one from [60].

X-risks in LibAUC
Below, we discuss how different X-risks are formulated for developing their optimization algorithms in the LibAUC library.
Area Under the ROC Curve (AUROC).Two formulations have been considered for AUROC maximization in the literature.A standard formulation is the pairwise loss minimization [61]: min where ℓ (•) is a surrogate loss.Another formulation is following the min-max optimization [32,64]: , where  > 0 is a margin parameter and Ω ⊂ R. In LibAUC, we have implemented an efficient algorithm (PESG) for optimizing the above min-max AUC margin (AUCM) loss with Ω = R + [64].The comparison between optimizing the pairwise loss formulation and the min-max formulation can be found in [67].
Partial Area Under ROC Curve (pAUC) is defined as area under the ROC Curve with a restriction on the range of false positive rate (FPR) and/or true positive rate (TPR).For simplicity, we only consider pAUC with FPR restricted to be less than  ∈ (0, 1].Let S ↓ [ 1 ,  2 ] ⊂ S be the subset of examples whose rank in terms of their prediction scores in the descending order are in the range of where  1 ≤  2 .Then, optimizing pAUC with FPR≤  can be cast into: where  = ⌊ − ⌋.To tackle challenge of handling S ↓ − [1, 𝑘] for data selection, we consider the following FCCO formulation [65]: where  > 0 is a temperature parameter that plays a similar role of .Let (w; x  , S − ) = E   ∈ S − exp(ℓ (ℎ w (x  ) − ℎ w (x  ))/) and   () =  log().Then (3) is a special case of FCCO.In LibAUC, we have implemented SOPAs for optimizing the above objective of one-way pAUC with FPR≤  and SOTAs for optimizing a similarly formed surrogate loss of two-way pAUC with FRP≤  and TPR≥  as proposed in [65].
Area Under Precision-Recall Curve (AUPRC) is an aggregated measure of precision of the model at all recall levels.A nonparametric estimator of AUPRC is Average Precision (AP) [3]: . By using a differentiable surrogate loss ℓ (ℎ w (x  ) −ℎ w (x  )) in place of I(ℎ w (x  ) ≥ ℎ w (x  )), we consider the following FCCO formulation for AP maximization: where  1 (w; In LibAUC, we implemented the SOAP algorithm with a momentum SGD or Adamstyle update [39], which is a special case of SOX analyzed in [50]. Normalized Discounted Cumulative Gain (NDCG) is a ranking performance metric for LTR tasks.The averaged NDCG over all queries can be expressed by , where  (w; x, S  ) = x ′ ∈ S  I(ℎ w (x ′ , ) − ℎ w (x, ) ≥ 0) denotes the rank of x in the set S  respect to , and   is the DCG score of a perfect ranking of items in S  , which can be pre-computed.For optimization, the rank function  (w; x   , S  ) is replaced by a differentiable surrogate loss, e.g., (w; x  , S  ) = x ′ ∈ S  ℓ (ℎ w (x ′ , ) − ℎ w (x, )).Hence, NDCG optimization is formulated as FCCO.In LibAUC, we implemented the SONG algorithm with a momentum or Adam-style update for NDCG optimization [41], which is a special case of SOX analyzed in [50].
Top- NDCG only computes the corresponding score for those that are ranked in the top- positions.We follow [41] to formulate top- NDCG optimization as a multi-block bilevel optimization: where  (•) is a sigmoid function,    is the top- DCG score of a perfect ranking of items, and   (w) is an approximation of the ( + 1)-th largest score of data in the set S  .The detailed formulation of lower-level problem  can be found in [41].In LibAUC, we implemented the K-SONG algorithm with a momentum or Adamstyle update for top- NDCG optimization [41].
Listwise CE loss is defined by a cross-entropy loss between two probabilities of list of scores similar to ListNet in [6]: where  (  Global Contrastive Losses (GCL) are the global variants of contrastive losses used for unimodal and bimodal SSL.For unimodal SSL, GCL can be formulated as: where  > 0 is a temperature parameter and x +  denotes a positive data of x  .Different from [7,42], GCL use all possible negative samples S −  for each anchor data instead of mini-batch samples B [63], which helps address the large-batch training challenge in [7].In LibAUC, we implemented an optimization algorithm called SogCLR [63] for optimizing both unimodal/bimodal GCL.
As of June 4, 2023, the LibAUC library has been downloaded 36,000 times.We also implemented two additional algorithms namely MIDAM for solving multi-instance deep AUROC maximization [66] and iSogCLR [40] for optimizing GCL with individualized temperature parameters, which are not studied in this paper.The Mini-batch Loss module in LibAUC is referred to as Dynamic Mini-batch Loss, which uses dynamically updated variables to adjust the mini-batch loss.The dynamic variables will be defined in the dynamic mini-batch loss, which can be evaluated by forward propagation.In contrast, we refer to the Mini-batch Loss module in existing libraries as Static Mini-batch Loss, which only uses the sampled data to define a min-batch loss in the same way of the objective but on mini-batch data.TheData Sampler module in LibAUC is referred to as Controled Data Sampler, which differ from standard random data samplers in that the ratio of the number of positive data to the number of negative data can be controlled and tuned to boost the performance.Next, we provide more details of these two and other modules.

Dynamic Mini-batch Loss
We first present the stochastic gradient estimator of the objective function, which directly motivates our design of Dynamic Mini-batch Loss module.
For simplicity of exposure, we will mainly use the FCCO problem of pAUC optimization (3) to demonstrate the core ideas of the library design.The designs of other algorithms follow in a similar manner.The key challenge is to estimate the gradient using a mini-batch of samples.To motivate the stochastic gradient estimator, we first consider the full gradient given by ∇ (w) = E x∈ S + ∇ ((w; x  , S − )) E x  ∈ S − ∇ exp(ℓ (w; x  , x  )/) .
To estimate the full gradient, the outer average over all data in S + can be estimated by sampling a mini-batch of data B 1 ⊂ S + .Similarly, the average over x  ∈ S − in parentheses can be also estimated by sampling a mini-batch of data B 2 ⊂ S − .A technical issue arises when estimating (w; x  , S − ) inside  .A naive mini-batch approach is to simply estimate (w; x  ,  − ) by using a mini-batch of data in B 2 ⊂ S − , i.e., (w; However, the problem is that the resulting estimator ∇ ((w; x  , B 2 )) is biased due to that  is a non-linear function, whose estimation error will depend on the batch size |B 2 |.As a result, the algorithm will not converge unless the batch size |B 2 | is very large.To address this issue, a moving average estimator is used to estimate (w  ; x  , S − ) at the -th iteration [39,41,50,63,65], which is updated for sampled data x  ∈ B  1 according to: where  ∈ (0, 1) is a hyper-parameter.It has been proved that the averaged estimation error of u  +1  for (w  ; x  , S − ) is diminishing in the long run.With the moving average estimators, the gradient of the objective function is estimated by 1 : The key steps of SOPAs for optimizing pAUC loss are in Algorithm 1 [65].To facilitate the implementation of computing the gradient estimator   , we design a dynamic mini-batch loss.The motivation of this design is to enable us to simply use the automatic differentiation of PyTorch or TensorFlow for calculating the gradient estimator   .In particular, on PyTorch we aim to define a loss such that we can directly call loss.backward() to compute   .To this end, we define a dynamic variable   = ∇ (u  +1  ) for x  ∈ B  1 and then define a dynamic mini-batch loss as loss = 1 For theoretical analysis u  +1  is replaced by u   in [50,65] Algorithm 1: SOPAs for solving pAUCLoss.
Compute the gradient estimator   by Update the model parameter by an optimizer 9 end Algorithm 2: High-level pseudocode for SOPAs.

Controlled Data Sampler
Unlike traditional ERM, DXO requires sampling to estimate the outer average and the inner average.In the example of pAUC optimization by SOPAs, we need to sample two mini-batches B  1 ⊂ S + and B  2 ⊂ S − at each iteration .We notice that this is common for optimizing areas under curves and ranking measures.For some losses/measures (e.g., AUPRC/AP, NDCG, top- NDCG, Listwise CE), both sampled positive and negative samples will be used for estimating the inner functions.According to our theoretical analysis [50], balancing the mini-batch size for outer average and that for the inner average could be beneficial for accelerating convergence.Hence, we design a new Data Sampler module to ensure that both positive and negative samples will be sampled and the proportion of positive samples in the mini-batch can be controlled by a hyper-parameter.
For CID problems, we introduce DualSampler, which takes as input hyper-parameters such as batch_size and sampling_rate, to For LTR problems, we introduce TriSampler, which has hyperparameters sampled_tasks to control the number of sampled queries for backpropogation, batch_size_per_task to adjust mini-batch size for each query, and sampling_rate_per_task to control the ratio of positives in each mini-batch per query.The TriSampler can be also used for multi-label classification problems with many labels such that sampling labels becomes necessary, which makes the library extendable for our future work.To improve the sampling speed, we have implemented an index-based approach that eliminates the need for computationally intensive operations such as concatenation and append.Figure 4 shows an example of DualSampler for constructing mini-batch data with even positive and negative samples on an imbalanced dataset with 4 positives and 9 negatives.We maintain two lists of indices for the positive data and negative data, respectively.At the beginning, we shuffle the two lists and then take the first 4 positives and 4 negatives to form a mini batch.Once the positive list is used up, we only reshuffle the positive list and take 4 shuffled positives to pair with next 4 negatives in the negative list as a mini-batch.Once the negative list is used up (an "epoch" is done), we re-shuffle both lists and repeat the same process as above.For TriSampler, the main difference is that we first randomly select some queries/labels before sampling the positive and negative data for each query/label.The following code snippet shows how to define DualSampler and TriSampler.

Optimizer
With a calculated gradient estimator, the updating rule for the model parameter of different algorithms for DXO follow similarly as (momentum) SGD or Adam [41,50,[63][64][65]67].Hence, the optimizer.step()is essentially the same as that in existing libraries.In addition to our built-in optimizer, users can also utilize other popular optimizers from the PyTorch/TensorFlow library, such as Adagrad, AdamW, RMSprop, and RAdam [12,30,33,48].Hence, we provide an optimizer wrapper that allows users to extend and choose appropriate optimizers.For the naming of the optimizer wrapper, we use the name of optimization algorithms corresponding to each specific X-risk for better code readability.An example of the optimizer wrapper for pAUC optimization is given below, where mode='adam' allows user to use Adam-style update.Another mode is 'SGD', which takes a momentum parameter as an argument to use the momentum SGD update.

Other Modules
In addition, we provide useful functionalities in other modules, including libauc.datasets,libauc.models,and libauc.metrics, to help users improve their productivity.The libauc.datasetsmodule provides pre-processing functions for several widely-used datasets, including CIFAR [28], CheXpert [25], and MovieLens [17], allowing users to easily adapt these datasets for use with LibAUC in benchmarking experiments.It is important to note that the definition of the Dataset class is slightly different from that in existing libraries.An example is given below, where __getitem__ returns a triplet that consists of input data, its label and its corresponding index in the dataset, where the index is returned for accommodating DXO algorithms for updating the u  +1  estimators.The libauc.modelsmodule offers a range of pre-defined models for various tasks, including ResNet [18] and DenseNet [22] for classification and NeuMF [20] for recommendation.libauc.metricsmodule offers evaluation wrappers based on scikit-learn for various metrics, such as AUC, AP, pAUC, and NDCG@K.Moreover, it provides an all-in-one wrapper (shown below) to evaluate multiple metrics simultaneously to improve the production efficiency.

Deployment
Before ending this section, we present a list of different losses, their corresponding data samplers and optimizer wrappers of the LibAUC library in Table 1.Finally, we present an example below of building the pipeline for optimizing pAUC using our designed modules.

EXPERIMENTS
In this section, we provide extensive experiments on three tasks CID, LTR and CLR.Although individual algorithms have been studied in their original papers for individual tasks, our empirical studies serves as complement to prior studies in that (i) ablation studies of the two unique features for all three tasks provide coherent insights of the library for optimizing different X-risks; (ii) comparison with an existing optimization-oriented library TFCO [9,34] for optimizing AUPRC is conducted; (iii) a larger scale dataset is used for LTR, and re-implementation of our algorithms for LTR is done on TensorFlow for fair comparison with the TF-Ranking library [35]; (iv) evaluation of different DXO algorithms based on different areas under the curves is performed exhibiting useful insights for practical use; (v) larger image-text datasets are used for evaluating SogCLR for bimodal SSL.Another difference from prior works [39,41,64,65] is that all experiments for CID and LTR are conducted in an end-to-end training fashion without using a pretraining strategy.However, we did observe the pretraining generally helps improve performance (cf. the Appendix).We choose three datasets from different domains, namely CIFAR10a natural image dataset [28], CheXpert -a medical image dataset [25] and OGB-HIV -a molecular graph dataset [21].For CIFAR10, we follow the original paper [64] to construct an imbalanced training set with a positive sample ratio (referred as imratio) of 1%.For evaluation, we sample 5% data from training set as validation set and re-train the model using full training set after selecting the parameters and finally report the performance on testing set with balanced positive and negative classes.For CheXpert, we follow the original work [64] by conducting experiments on 5 selected diseases, i.e., Cardiomegaly (imratio=12.2%),Edema (imratio=32.2%),Consolidation (imratio=6.8%),Atelectasis (imratio=31.2%),Pleural Effusion (imratio=40.3%),with an average of imratio of 24.54%.We use the downsized 224 × 224 frontal images only for training.Due to the unavailability of testing set, we report the averaged results of 5 tasks on the official validation set.For OGB-HIV, the dataset has an imratio of 1.76% and we use official train/valid/test split for experiments and report the final performance on testing set.
For each setting, we repeat experiments three times using different random seeds and report the final results in mean±std.
For modeling, we use ResNet20, DenseNet121, and DeepGCN [18,22,29] for the three datasets, respectively.We consider optimizing three losses, namely AUCMLoss, APLoss, pAUCLoss by using PESG, SOAP, SOPAs, respectively.For the latter two, we use the pairwise squared hinge loss with a margin parameter in their definition.Thus, all losses have a margin parameter, which is tuned in [0.1, 0.3, 0.5, 0.7, 0.9, 1.0].For APLoss and pAUCLoss, we tune the moving average estimator parameter  in the same range.For pAUCLoss, we also tune the temperature parameter in [0.1, 1.0, 10.0].For DualSampler, we tune sampling_rate in [0.1, 0.3, 0.5].For baselines, we compare two popular loss functions used in the literature, i.e., CE loss and Focal loss.For Focal loss, we tune α in [1,2,5] and γ in [0.25, 0.5, 0.75].For optimization, we use the momentum SGD optimizer for all methods with a default momentum parameter 0.9 and tuned initial learning rate in [0.1, 0.05, 0.01].We decay learning rate by 10 times at 50% and 75% of total training iterations.For CIFAR10, we run all methods using a batch size of 128 for 100 epochs.For CheXpert, we train models using a batch size of 32 for 2 epochs.For OGB-HIV, we train models using a batch size of 512 for 100 epochs.To evaluate the performance, we adopt three different metrics, i.e., AUROC, AP, and pAUC (FPR<0.3).We select the best configuration based on the performance metric to be optimized, e.g., using AUROC for model selection of AUCMLoss.The results are summarized in the Table 2.
We have several interesting observations.Firstly, directly optimizing performance metrics leads to better performance compared to baseline methods based on ERM framework.For example, PESG, SOAP, and SOPAs outperform CE and Focal Loss by a large margin in all datasets.This is consistent with prior works.Secondly, optimizing a specific metric does not necessarily has the best performance for other metrics.For example, on OGB-HIV dataset PESG has the highest AUROC but the lowest AP score, while SOAP has the highest AP score but lowest AUROC and pAUC, and SOPAs has the highest pAUC score.This confirms the importance of choosing appropriate methods in LibAUC for corresponding metrics.Thirdly, on CheXpert, it seems that optimizing pAUC is more beneficial than optimizing full AUROC.SOPAs achieves better performance than PESG and SOAP in all three metrics.
Comparison with the TFCO library.We compare LibAUC (SOAP) with TFCO [9,34] for optimizing AP.We run both methods using batch size of 128 for 100 epochs with Adam optimizer and learning rate of 1e-3 and weight decay of 1e-4 on constructed CI-FAR10 with imratio={1%,2%}.We plot the learning curves on training and testing sets in Figure 5.The results indicate that LibAUC consistently performs better than TFCO.

Learning to Rank
We evaluate LibAUC on a LTR task for movie recommendation.The goal is to rank movies for users according to their potential interests of watching based on their historical ratings of movies.We compare the LibAUC library for optimizing ListwiseCELoss, NDCGLoss and top- NDCG loss denoted by NDCGLoss(K) against the TF-Ranking library [35] for optimizing ApproxNDCG, GumbelNDCG, ListMLE, on two large-scale movie datasets MovieLens20M and MovieLens25M from MovieLens website [17].MovieLens20M contains 20 millions movie ratings from 138,493 users and MovieLens25M contains 25 millions movie ratings from 162,541 users.Each user has at least 20 rated movies.Different from [41], we re-implement the SONG and K-SONG (its practical version) on TensorFlow for optimizing the three losses for a fair comparison of running time with TF-Ranking since it is implemented in TensorFlow.To construct training/validation/testing set, we first sort the ratings based on timestamp for each user from oldest to newest.Then, we put 5 most recent ratings in testing set, and the next 5 most recent items in validation set.For training, at each iteration we randomly sample 256 users, and for each user sample 5 positive items from the remaining rated movies and 300 negatives from all unrated movies.For computing validation and testing performance, we sample 1000 negative items from the movie list similar to [41].For modeling, we use NeuMF [20] as backbone network for all methods.We use the Adam optimizer [27] for all methods with an initial learning rate of 0.001 and weight decay of 1e-7 for 120 epochs by following similar settings in [41].During training, we decrease learning rate at 50% and 75% of total iterations by 10 times.For evaluation, we compute and compare NDCG@5 and NDCG@20 for all methods.For NDCGLoss, NDCGLoss(K) and ListwiseCELoss, we tune moving average estimator parameter  in range of [0.1, 0.3, 0.5, 0.7, 0.9, 1.0].For NDCGLoss(K), we tune  in [50,100,300].We repeat the experiments three times using different random seeds and report the final results in mean±std.To measure the training efficiency, we conduct the experiments on a NVIDIA V100 GPU and compute the average training times over 10 epochs.
As shown in the Figure 6 (left), LibAUC achieves better performance on both datasets.It is worth mentioning that the results of all methods we reported are generally worse than those reported in [41], likely due to different negative items being used for evaluation.In addition, optimizing NDCGLoss() is not as competitive as optimizing NDCGLoss, which is because that we did not use the pretraining strategy used in [41].In Appendix, we show that using pretraining is helpful for boosting the performance of optimizing NDCGLoss().The runtime comparison, where we report the average runtime in seconds per epoch, is shown in Figure 6 (right).The results show that our implementation of LibAUC on TensorFlow is even faster than three methods in TF-Ranking.It is interesting to note that LibAUC for optimizing ListwiseCE loss is 1.6× faster than TF-Ranking for optimizing GumbelLoss yet has better performance.

Contrastive Learning of Representations
In this section, we demonstrate the effectiveness of LibAUC (Sog-CLR) for optimizing GCLoss on both uimodal and bimodal SSL tasks.For unimodal SSL, we use two scales of the ImageNet dataset: a small subset of ImageNet with 100 randomly selected classes (about 128k images) denoted as ImageNet-100, and the full version of Ima-geNet (about 1.2 million images) denoted as ImageNet-1000 [11].For bimodal SSL, we use MS-COCO and CC3M [16,47] for experiments.MS-COCO is a large-scale image recognition dataset containing over 118,000 images and 80 object categories, and each image is associated with 5 captions describing the objects and their interactions in the image.CC3M is a large-scale image captioning dataset that contains almost 3 million image-caption pairs.For evaluation, we compare the feature quality of pretrained encoder on ImageNet-1000 validation set, which consists of 50,000 images that belong to 1000 classes.For unimodal SSL, we conduct linear evaluation by fine-tuning a new classifier in a supervised fashion after pretraining.For bimodal SSL, we conduct zero-shot evaluation by computing similarity scores between the embeddings of the prompt Table 3: Results for Self-Supervised Learning.Numbers are denoted in %.SogCLR [63] is re-implemented in PyTorch.

Dataset Scale Modality
Acc@1 Acc@ For unimodal SSL, we follow the same settings in SimCLR [7].We use ResNet-50 with a two-layer non-linear head with a hidden size of 128.We use LARS optimizer [62] with an initial learning rate of 0.075 × √ ℎ_ and weight decay of 1e-6.We use a cosine decay strategy to decrease learning rate.We use a batch size of 256 to train ImageNet-1000 for 800 epochs and ImageNet-100 for 400 epochs with a 10-epoch warm-up.For linear evaluation, we train the classifier for additional 90 epochs using the momentum SGD optimizer with no weight decay.For bimodal SSL, we use a transformer [44,49] as the text encoder (cf appendix for structure parameters) and ResNet-50 as the image encoder [42].Similarly, we use LARS optimizer with the same learning rate strategy and weight decay.We use a batch size of 256 for 30 epochs, with a 3-epoch warmup.For zero-shot evaluation, we compute the accuracy based on the cosine similarities between image embeddings and text embeddings using 80 different prompt templates similar to [42].Note that we randomly sample one out of five text captions to construct textimage pair for pretraining on MS-COCO.We compare SogCLR with SimCLR for unimodal SSL and with CLIP for bimodal SSL tasks.For SogCLR, we tune  in [0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 1.0] and tune temperature  in [0.07, 0.1].All experiments are run on 4-GPU (NVIDIA A40) machines.The results are summarized in Table 3.
The results demonstrate that SogCLR outperforms SimCLR and CLIP for optimizing mini-batch contrastive losses in both tasks.In particular, SogCLR improves 2.2%, 2.9% over SimCLR on ImageNet datasets, and improves 0.5%, 1.6% over CLIP on two bimodal datasets.It is notable that the pretraining for ImageNet lasts up to 800 epochs, while the pretraining on the two bimodal datasets is only performed for 30 epochs due to limited computational resources.According to theorems in [63], the optimization error of SogCLR will diminish as the training epochs increase.We expect that SogCLR exhibit have larger improvements over CLIP with longer epochs.

Ablation Studies
In this section, we present more ablation studies to demonstrate the effectiveness of our design and superiority of our library.We directly use the best hyper-parameters tuned in Section 4.1, 4.2 except for , which is tuned from 0.1 to 1.0.The performance is evaluated using AP (SOAP), pAUC (SOPAs), NDCG@5 (SONG), and Top-1 Accuracy (SogCLR), respectively.The final results of this comparison are summarized in Table 4. Overall, we find that all methods achieve the best performance when  is less than 1.  5. For LTR, we use MovieLens20M dataset.We fix the number of sampled queries (i.e., users) to 256 in each mini-batch and vary the number of positive and negative items, which are tuned in {1, 5, 10} and {1, 5, 10, 100, 300, 500, 1000}, respectively.We fix  = 0.1 and train the model for 120 epochs with the same learning rate, weight decay and learning rate decaying strategies as in section 4.2.The results are evaluated in NDCG@5 and are shown in Table 6.Both results demonstrate that tuning the positive sampling rate is beneficial for performance improvement.
The results reveal that DualSampler largely boosts the performance for AUCMLoss on CIFAR10 and OGB-HIV when sampling rate (sr) is set to 10%.It is interesting to note that balancing the data (sr=50%) did not necessarily improve performance on three cases.However, generally speaking using a sampling ratio higher than the original imbalance ratio is useful.For LTR with TriSampler, we observe a dramatic performance increase when increasing the number  of positive samples from 1 to 10, and the number of negative samples from 1 to 300.However, when further increasing the number of negatives from 300 to 1000, the improvement is saturated.

4.4.3
The Impact of Batch Size.We study the impact of the batch sizes on our methods (SOAP, SOPAs, SONG, SogCLR) using dynamic mini-batch losses and that using static mini-batch losses (i.e.,  = 1).We follow the same experiment settings as in previous section and only vary the batch size.For each batch size, we tune  correspondingly as theories indicate its best value depends on batch size.For SogCLR, we train ResNet50 on ImageNet1000 for 800 epochs using batch sizes in {8192, 2048, 512, 128}.For SOAP and SOPAs, we train ResNet20 on OGB-HIV for 100 epochs using batch sizes in {512, 256, 128, 64}.For SONG, we train NeuMF for 120 epochs on MovieLens20M using batch sizes in {256, 128, 64, 32}.The results are shown in Figure 7, which demonstrates our design is more robust to the mini-batch size.4.4.4Convergence Speed.Finally, we compare the convergence curves of selected algorithms on the OGB-HIV, MovieLens20M, and ImageNet100 datasets.We use the tuned parameters from previous sections to plot the convergence curves on the testing sets.The results are illustrated in Figure 8.In terms of classification, it is observed that PESG, and SOPAs converge much faster than optimizing CE and Focal loss.For MovieLens20M dataset, we find that SONG has fastest convergence speed compared to all other methods, and K-SONG (without pretraining) is faster than the other baselines but slower than SONG.In the case of SSL, we observe that SogCLR and SimCLR achieve similar performance at the beginning stage, however, SogCLR gradually outperforms SimCLR as the training time goes longer.

CONCLUSION & FUTURE WORKS
In this paper, we have introduced LibAUC, a deep learning library for X-risk optimization.We presented the design principles of LibAUC and conducted extensive experiments to verify the design principles.Our experiments demonstrate that the LibAUC library is superior to existing libraries/approaches for solving a variety of tasks including classification for imbalanced data, learning to rank, and contrastive learning of representations.Finally, we note that our current implementation of the LibAUC library is by no means exhaustive.In the future, we plan to implement more algorithms for more X-risks, including performance at the top, such as recall at top- positions, precision at a certain recall level, etc.

Figure 1 :
Figure 1: Mappings of X-risks to optimization problems.
) ∝    denotes a probability for a relevance score    to be the top one.(4) is a special case of FCCO by setting (w; x   , S  ) =

Figure 2 :
Figure 2: The pipeline of LibAUC modules.Highlighted blocks denote the unique modules of the LibAUC library.E x∈ S  exp(ℎ w (x; ) − ℎ w (x   ; )) and  , () =  (   ) log().In LibAUC, we implemented an optimization algorithm, similar to SONG, for optimizing listwise CE loss.Global Contrastive Losses (GCL) are the global variants of The pipeline of training a DL model in the LibAUC library is shown in Figure 2, which consists of five modules, namely Dataset, Data Sampler, Model, Mini-batch Loss, and Optimizer.The Dataset module allows us to get a training sample which includes its input and output.The Data Sampler module provides tools to sample a mini-batch of examples for training at each iteration.The Model module allows us to define different deep models.The Mini-batch Loss module defines a loss function on the selected mini-batch data for backpropagation.The Optimizer module implements methods for updating the model parameter given the computed gradient from backpropagation.While the Dataset, Model, and Optimizer modules are similar to those in existing libraries, the key differences lie in the Mini-batch Loss and Data Sampler modules.

1Figure 3 :
Figure 3: Left: SOPAs for optimizing pAUC; Right: its pseudo code using automatic differentiation of a dynamic mini-batch loss.The corresponding parts of the algorithm and pseudocode are highlighted in the same color.

Figure 4 :
Figure 4: Illustration of DualSampler for an imbalanced dataset with 4 positives • and 9 negatives •. generate the customized mini-batch samples, where sampling_rate controls the number of positive samples in the mini-batch according to the formula # positives = batch_size*sampling_rate.For LTR problems, we introduce TriSampler, which has hyperparameters sampled_tasks to control the number of sampled queries for backpropogation, batch_size_per_task to adjust mini-batch size for each query, and sampling_rate_per_task to control the ratio of positives in each mini-batch per query.The TriSampler can be also used for multi-label classification problems with many labels such that sampling labels becomes necessary, which makes the library extendable for our future work.To improve the sampling speed, we have implemented an index-based approach that eliminates the need for computationally intensive operations such as concatenation and append.Figure4shows an example of DualSampler for constructing mini-batch data with even positive

Figure 6 :
Figure 6: Left: Results on MovieLens datasets.Right: Comparison of training time for LibAUC and TF-Ranking.

4. 4 . 1
Effectiveness of Dynamic Mini-batch Losses.To verify the effectiveness of the dynamic mini-batch losses, we compare them with conventional static mini-batch losses.To this end, we focus on SOAP, SOPAs, SONG and SogCLR, and compare their performance with different values of  in our framework.When setting  = 1, our algorithms will degrade into their conventional mini-batch versions.

4 . 4 . 2
Effectiveness of Data Sampler.We vary the positive sampling rate (denoted as sr) in the DualSampler for CID by optimizing AUCMLoss, and in the TriSampler for LTR by optimizing NDCGLoss.For CID, we use three datasets: CIFAR10 (1%), CheXpert, and OGB-HIV, and tune sr={original, 10%, 30%, 50%}, where sr=original means that we simply use the random data sampler without any control.Other hyper-parameters are fixed to those found as in Section 4.1.The results are evaluated in AUROC and summarized in Table
1 ), . . ., (x  ,   )} denote a set of training data, where x  ∈ X ⊂ R   denotes the input feature vector and   ∈ {1, −1} denotes the corresponding label.Let S + = {x  :   = 1} contain  + positive examples and S − = {x  :   = −1} contain  − negative examples.Denote by ℎ w (x) : X → R a parametric predictive function (e.g., a deep neural network) with a parameter w ∈ R  .We use E x∼S = 1 | S | x∈ S interchangeably below.For LTR, let Q denote a set of  queries.For a query  ∈ Q, let S  = {x   ,  = 1, . . .,   } denote a set of   items (e.g., documents, movies) to be ranked.For each x  denote a set of  +  (positive) items relevant to , whose relevance scores are non-zero.

Table 1 :
The list of losses, corresponding samplers and optimizer wrappers in libauc.For a complete list, please refer to the documentation of LibAUC.

Table 2 :
Results on three classification tasks.Best results are marked in bold and second-best results are marked in underline.
[63] and images.Due to the high training cost, we only run each experiment once.It is worth noting that the two bimodal datasets were not used in[63].

Table 5 :
Tuning the sampling rate is beneficial for AUCMLoss.

Table 6 :
Tuning the sampling rate is beneficial for NDCGLoss on MovieLens20M.