Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding

Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 5 widely used pretrained language models and find it particularly superior for models with few parameters or long input.


INTRODUCTION
Natural language understanding (NLU) has been pushed a remarkable step forward by deep neural models.To further enhance the performance of deep models, enlarging model size [7,8,33,57] and input context [6,32,67] are two conventional and effective ways, where the former introduces more hidden neurons and the latter brings more input neurons.Although neural models with more hidden or input neurons have higher accuracy on average, large-scale models do not always beat small models.For example, on one hand, many network pruning methods have shown that compressed models with significantly reduced parameters (neuron connections) can maintain accuracy [27,39,45] and even improve generalization [2], Meyes et al. [48] find that ablation of neurons can consistently improve performance in some specific classes, and Zhong et al. [77] empirically demonstrate that larger language models indeed perform worse on a non-negligible fraction of instances.These phenomena indicate that some hidden neurons in the currently trained model are dispensable or even obstructive.On the other hand, much of the work on question answering [14,74] and query understanding [18,50,81] has noted that feeding more contextual information is more likely to distract the model and hurt performance.This is not surprising, as more input neurons not only mean more relevant features but are also likely to introduce more noise that interferes with the model.Similar to network pruning that cuts out inefficient parameters through post-processing, many context selection methods [23,49,62,76] trim off noisy segments from the input context by pre-processing.In essence, both network pruning and context selection reduce inefficient hidden or input neurons through additional processing.However, apart from extrinsically reducing inefficient neurons, can we intrinsically improve the utility of neurons during model training?
Imagine an ideal neural network in which all its neurons should be able to cooperate efficiently to maximize the utility of each neuron.If a fraction of the input or hidden neurons in this network are ablated 1 (disabling partial input context or model parameters), the ablated submodel is not supposed to perform better, even if the ablated neurons are noisy.This is because an efficient2 model should have already suppressed these noises.Following this intuition, we can roughly find a comparison principle between the original full model and its ablated model: the fewer neurons are ablated in the model, the better the model should perform.During training, we can use task-specific losses as a proxy for model performance on training samples, with lower task-specific losses implying better performance.
For example, the task-specific loss of the efficient full model (a) in Fig. 1 is supposed to be minimal, and if the ablated model (b) is also efficient with respect to its restricted parameter space, the task-specific loss of the ablated model (d) is supposed to be greater than that of (b) because (d) ablates one more input neuron than (b).
Noting the gap between the ideal model and reality [49,77], we aim to ensure this necessity (comparison principle) during the training to improve the model's utilization of neurons.Based on the natural comparison principle between models, we propose a cross-model comparative loss to train models without additional manual supervision.In general, the comparative loss is a ranking loss on top of multiple task-specific losses.First, these task-specific losses are derived from the full neural model and several comparable ablated models whose neurons are ablated to varying degrees.Next, the ranking loss is a pairwise hinge loss that penalizes models that have fewer ablated neurons but larger task-specific losses.Concretely, if a model with fewer ablated neurons acquires a larger task-specific loss than another model with more ablated neurons, then the difference between the task-specific losses of the pair will be taken into account in the final comparative loss; otherwise the pair complies with the comparison principle and does not incur any training loss.In this way, the comparative loss can drive the order of task-specific losses to be consistent with the order of the ablation degrees.Through theoretical derivation, we also show that comparative loss can be viewed as a dynamic weighting of multiple task-specific losses, enabling adaptive assignment of weights depending on the performance of the full/ablated models.
The comparability among multiple ablated models is a fundamental prerequisite for comparative loss.As a counterexample, although the ablated model (c) in Fig. 1 ablates less neurons than (d), they are not comparable and so no comparative loss can be applied.To make the ablated models comparable with each other, we progressively ablate the models.The first ablated model is obtained by performing one ablation on the basis of the full model.If more ablated models are needed, in each subsequent ablation step we construct a new ablated model by performing a further ablation on top of the ablated model from the previous step, which makes the newly ablated model certainly a comparable submodel of the previous ones.We provide two alternative controlled ablation methods for each ablation step, called CmpDrop and CmpCrop.CmpDrop ablates hidden neurons by the dropout [30] technique, which is theoretically applicable to all dropout-compatible models.While CmpCrop ablates input neurons by cropping extraneous context segments and is theoretically applicable to all tasks that contain extraneous content in the input context.
We apply comparative loss with CmpDrop or/and CmpCrop on 14 datasets from 3 NLU tasks (text classification, question answering and query understanding) with distinct prediction types (classification, extraction and ranking) Manuscript submitted to ACM on top of 5 widely used pretrained language models (PLMs) [3,15,21,37,44].The empirical results demonstrate the effectiveness of comparative loss over state-of-the-art baselines, as well as the enhanced utility of parameters and context.Our analysis also confirms that comparative losses can indeed more appropriately weight multiple task-specific losses, as indicated by our derivation.By exploring different comparison strategies, we observe that comparing the models ablated by first CmpCrop and then CmpDrop can bring the greatest improvement.Interestingly, we find that comparative loss is particularly effective for models with few parameters or long inputs.This may imply that comparative loss can help models with lower capacity to fit the more or longer training samples better, while models with higher capacity are inherently prone to fit less data, so comparative loss is less helpful.Moreover, we discover that different ablation methods have different effects on training, with CmpDrop helping task-specific loss to decrease to lower levels faster and CmpCrop alleviating overfitting to some extent.
The main contributions can be summarized as follows: • We propose comparative loss, a cross-model loss function based on the comparison principle between the full model and its ablated models, to improve the neuronal utility without additional human supervision.• We progressively ablate the models to make multiple ablated models comparable and present two controlled ablation methods based on dropout and context cropping, applicable to a wide range of tasks and models.
• We theoretically show how comparative loss works and empirically demonstrate its effectiveness through experiments on 3 distinct natural language understanding tasks.We release the code and processed data at https://github.com/zycdev/CmpLoss.

PRELIMINARIES
Before introducing our cross-model comparative loss, we review some of the concepts and notations needed afterward.
We first introduce typical training methods for the model, followed by formalizations of network pruning and context selection methods that can further improve the model performance by removing inefficient inputs or hidden neurons.
Finally, we elaborate on the concept of ablation, which recurs throughout the paper.

Conventional Training
Given a training dataset D for a specified task and a neural network  parameterized by  ∈ R | | , the training objective for each sample (, ) ∈ D is to minimize empirical risk where  is the input context,  in output space Y is the label, and  : Y × Y → R ≥0 is the task-specific loss function, R ≥0 denoting the set of non-negative real numbers.In NLU tasks,  is typically a sequence of words, while  can be a single category label for classification [55,66,73], or a pair of start and end boundaries for extraction [54,59,74], or a sequence of relevance levels for ranking [51,52,56,78].

Network Pruning
After training a neural model  (;  ), to reduce memory and computation requirements at test time, network pruning [5] entails producing a smaller model  (;  ⊙  ′ ) with similar accuracy through post-hoc processing.Here  ∈ {0, 1} | | is a binary mask that fixes certain pruned parameters to 0 through elementwise product ⊙, and the parameter vector  ′ may be different from  because  ⊙  ′ is usually retrained from  ⊙  to fit the pruned network structure.

Manuscript submitted to ACM
Although pruning is often viewed as a way to compress models, it has also been motivated by the desire to prevent overfitting.Pruning systematically removes redundant parameters and neurons that do not significantly contribute to performance and thus have much less prediction variance, which makes us reminiscent of dropout [36], another widely used technique to avoid overfitting.Similarly, dropout also uses a mask to disable a fraction (such as %) of parameters or neurons.The significant difference, though, is that the mask  in dropout is randomly sampled from a Bernoulli(1 − %) distribution, rather than deterministically defined by a criterion (e.g., the bottom % of parameters in magnitude should be masked) as in pruning.This in turn brings convenience: a model trained with dropout does not need to be retrained for a specific mask, because the model's neurons have already started to learn how to adapt to the absence of some neurons in the previous training.

Context Selection
To eliminate the noisy content in the input context  and further improve the model performance, context selection selectively crops out a condensed context  ′ ⊑  to produce the final model prediction.In general, the model requires specialized training to fit the selected context.Therefore, context selection is pre-hoc processing relative to training, requiring removing the noise from the training samples in advance.With a slight abuse of notation, here we use  ′ ⊑  to denote that  ′ is a condensed subsequence (possibly equal) of .In general,  ′ is an ordered combination of segments of , where the segments are usually at the sentence [49,53], chunk [76], paragraph [14], or document [23,62] granularity.It is worth noting that the selector for segment selection generally requires additional supervised training and needs to be run in advance of the prediction, which introduces additional computation overhead.

Ablation
To assess the contribution of certain components to the overall model, ablation studies investigate model behavior by removing or replacing these components in a controlled setting [17].Here, in the context of machine learning, "ablation" refers to the removal of components of the model, which is an analogy to ablative brain surgery (removal of components of an organism) in biology [48].We refer to the model after component removal as the "ablated model", which should continue to work.However, if the removed components are responsible for performance improvement, the performance of the ablated model is expected to be worse [24].
In this paper, we use "ablation" to refer specifically to the removal of some neurons of a neural model, i.e., to set the output of some specific neurons to zero.From such a neuronal perspective, network pruning and context selection can be viewed as two kinds of ablation, the former removing some low-contributing hidden neurons after training and the latter removing some low-information input neurons before training.However, in contrast to ablation studies that aim to investigate the role of the ablated neurons, we aim to learn to improve the utility of the ablated neurons.

METHODLOGY
The primary motivation of this work is to inherently improve the utility of neurons in NLU models through a cross-model training objective, rather than post-hoc network pruning or pre-hoc context selection to eliminate inefficient neurons.
In the following, we first describe a comparison principle.Then, we propose a novel comparative loss based on the corollary of the comparison principle and present how to train models with comparative loss by two controlled ablation methods.Finally, we discuss how comparative loss works.
Manuscript submitted to ACM

Comparison Principle
For an efficient model, we believe that all its neurons should be able to work together efficiently to maximize the utility of each neuron.This means that each neuron should contribute to the overall model, or at least be harmless, because the cooperation of neurons is supposed to eliminate the negative effects of noise that may be produced by individual neurons.Thus, if we ablate some neurons, even those that produce noise, due to the missing contribution of the ablated neurons, then the ablated submodel should perform no better than the original full model, in other words, its task-specific loss should be no smaller than the original.
Formally, we define a neural model as an efficient model if and only if it performs no weaker than any of its ablation models, and we formalize the comparison principle between an efficient model and its ablation models as follows.
Comparison Principle.Suppose  (;  ) is an efficient neural model for the input  with respect to the parameter space R | | , let  ′ ⊏  be the ablated input and  ′ =  ⊙  be the ablated parameters.Then, for any subsequence  ′ of  whose label is still , the input-ablated model  ( ′ ;  ) should not perform better than the full model  (;  ), and for any parameters  ′ masked by arbitrary , the parameter-ablated model  (;  ′ ) should not perform better than  (;  ), i.e.,
In the above definition, we consider that an efficient neural model should be input-efficient and parameter-efficient.
In particular, the input-efficient property refers that the model can efficiently utilize the input neurons (words).If the model  (;  ) satisfies Eq. ( 2), we say that  (•;  ) is input-efficient for the input .The parameter-efficient property refers that the model can utilize the hidden neurons efficiently.If the model  (;  ) satisfies Eq. (3), we say  (;  ) is parameter-efficient for the input  with respect to the parameter space R | | .According to Eq. (3), we can definitely find at least one vector  that is parameter-efficient for the input , i.e., the zero vector and the optimal parameter vector that minimizes the empirical risk.If the parameter space is large enough, from those vectors parameter-efficient for , we can find some parameters that simultaneously satisfy Eq. (2), i.e, the input-efficient property.That is, if the parameter space R | | is large enough, there exists at least one parameter vector  that makes the model  (;  ) efficient for .Specially, if all activation functions in the neural model  have zero output values for zero, then  ( ′ ; 0) =  (; 0) ∀ ′ ⊏ , and hence the parameter vector  = 0 is efficient for any .
Notably, we restrict the ablated input  ′ for comparison to only those subsequences whose ground-truth output ( ′ ) remains unchanged, i.e., ( ′ ) = () = .This is because ablation may remove some key information from the original input , such as the trigger words in the classification, resulting in an unknown change in the label .In this unusual case, (,  ( ′ ;  )) will no longer be a reasonable proxy of the performance of the input-ablated model, so it makes no sense to compare it to the task-specific loss of the original model.For example, in binary classification ( ∈ {0, 1}), for the original input ,  (;  ) predicts the correct category  with low confidence, whereas for the ablated input  ′ whose category label changes to ( ′ ) = 1 − ,  ( ′ ;  ) predicts 1 −  with high confidence.Even though (,  (;  )) ≤ (,  ( ′ ;  )), the input-ablated model  ( ′ ;  ) actually outperforms the original  (;  ), i.e., (,  (;  )) ≥ (1 − ,  ( ′ ;  )), and we cannot consider  (•;  ) to be input-efficient for .Although we can use (( ′ ),  ( ′ ;  )) as a performance proxy for the input-ablated model in Eq. ( 2), in practice it is difficult to know how the labels of the ablated inputs will change, so we try to avoid such label-changing scenarios.For the sake of concision, Manuscript submitted to ACM from here on, we default the ablation of the input context does not change the output label if not otherwise specified, i.e., ( ′ ) = .
Formally, we define an efficient model to be hereditarily efficient if and only if its ablated models are all efficient.
Similarly, if the parameter-ablated models of a parameter-efficient model are all parameter-efficient, we call this parameterefficient model hereditarily parameter-efficient.And the hereditarily input-efficient model is defined in the same way.Specifically, the parameter vector  = 0 is hereditarily parameter-efficient, and  (; 0) is also hereditarily efficient for any  if all activation functions of  have zero output values for zero.
Based on the definition of the hereditarily efficient model and the comparison principle, we can draw the following corollary.

Comparative Loss
Based on Corollary 3.1, we can train a hereditarily efficient model with the objective of ordered comparative relation in Eq. ( 4).To measure the difference from the desirable order, we can use pairwise hinge loss [29] to evaluate the ranking of the task-specific losses of the full model and its ablated models, like  −1 =0  =+1 max(0,  ( ) − (  ) ).However, optimizing this ranking loss alone cannot guarantee that these task-specific losses are minimized, i.e., the full/ablated models may not be empirical risk minimized (ERM) [63] with respect to their parameter spaces.To push these models to be ERM, we introduce a special scalar  as the baseline value of the task-specific loss and assume that it is derived from a dummy ablated model  ( (+1) ;  (+1) ).The dummy model is set to have the highest degree of ablation, and in principle, its task-specific loss  (+1) should be the highest.However, to push the task-specific losses of the real models { ( ( ) ,  ( ) )}  =0 down, we usually set  (+1) =  to a small value (e.g., 0) and expect all { ( ) }  =0 to be reduced by this target.In this way, our comparative losses can still be written as a pairwise ranking loss, except that on top of the  + 2 efficient ideal parameter -efficient inputefficient hereditarily efficient ERM Fig. 2. The Venn diagram for some of the concepts in this paper.The empirical risk minimized (ERM) refers to the minimization of Eq. (1), which is a subset of the parameter-efficient (satisfying Eq. ( 3)).The efficient (intersecting purple region) model in the comparison principle, in addition to being parameter-efficient, also needs to be input-efficient (satisfying Eq. ( 2)).The hereditarily efficient model requires not only the full model to be efficient, but also any of its ablated models to be efficient, i.e., satisfying Eq. (4) in Corollary 3.1.The training objective of the comparative loss Eq. ( 5) is both hereditarily efficient and ERM, i.e., the central overlapping grid region.
task-specific losses, Fig. 2 visualizes the localization (central grid region) of the ideal model of comparative loss, which is both ERM and hereditarily efficient.The hereditarily efficient is a subset of the efficient, and the efficient is the intersection of the input-efficient and parameter-efficient.In this light, comparative loss sets a stricter training objective than ERM.When we set  and  to 0, Eq. ( 5) can degenerate to Eq. ( 1).Further, the comparative loss is equivalent to where the first term is to minimize the empirical risk of those not reaching the target , and the second term constrains the comparative relation to pursue the full model being hereditarily efficient.
To train using comparative loss, we first need to obtain several comparable ablated models and task-specific losses.
As shown in Fig. 3, we consider the original model with the input of the entire context as the full model  ( (0) ;  (0) ).
According to Corollary 3.1, we progressively perform -step ablation based on the full model.At the -th ablation step, we use CmpCrop or CmpDrop to ablate a small portion of the input or hidden neurons based on the model  ( ( −1) ;  ( −1) ) from the previous step, which makes the newly ablated model  ( ( ) ;  ( ) ) comparable to all its ancestor models.After all these models have predicted once, we have  + 1 comparable task-specific losses.Together with  (+1) =  from the dummy ablated model, we can calculate the final loss using Eq. ( 5).Using stochastic gradient descent optimization as an example, Algorithm 1 illustrates the training process more formally.
CmpDrop and CmpCrop in Algorithm 1 are the two alternative ablation methods we present for each ablation step, the former for ablating the parameters (hidden neurons) and the latter for ablating the input context (input neurons).

Manuscript submitted to ACM
Task-specific Loss Function (,  %) Loss Differences They both randomly ablate neurons in a controlled manner on top of the previous model, which allows the coverage of all potential ablated models without retraining each ablated model.This is because the randomly ablated models are jointly trained and adapt to the absence of some neurons during the training process.As for which one to use at each ablation step can be specific to the model and task dataset.Ideally, CmpDrop can be used as long as the model is dropout compatible, and CmpCrop can be used as long as the input context of the task contains dispensable segments.Below we will introduce CmpDrop and CmpCrop in detail.

3.2.1
CmpDrop: Ablate Parameters by Dropout.Dropout randomly disables each neuron with probability , which coincides with our need to randomly ablate hidden neurons.To obtain a model  (•;  ( ) ) with more ablated parameters, instead of simply applying a larger dropout rate on the original model  (•;  (0) ), we ablate the surviving neurons from the previous ablated model  (•;  ( −1) ) with probability  consistently.Specifically, the output values of those dropped neurons are set to zeros, and the output values of the surviving neurons are scaled by 1/(1 − ) to ensure consistency with the expected output value of a neuron in all full/ablated models [36].This is equivalent to applying a mask with scaling 4  ( ) ∈ {0, 1, 1/(1 − )} | | to the previous parameters  ( −1) to obtain the ablated parameters 4 Slightly different from the binary mask in the comparison principle, we incorporate the scaling factors together into the mask in order to still express the parameter ablation concisely by  ( ) =  ( ) ⊙  ( −1) .
In practice, we can leverage the existing dropout to implement it.However, for the comparability of the ablated models, we must use the same random seed and the same state of the random number generator in each CmpDrop.In this way, assuming that the current ablation step is the -th execution of CmpDrop, we can simply run the model with the dropout rate of 1 − (1 − )  .

CmpCrop
Then, CmpCrop can produce a streamlined context by randomly cropping out several insignificant segments from the non-support context  \  ★ .In this way, the trimmed streamlined context is sure to contain the minimum support context, so the ground-truth output does not change.
In practice, to use CmpCrop, we must ensure that enough insignificant segments are set aside in the original context  (0) for cropping.The segments can be of document, paragraph or sentence granularity.For example, in question answering, an insignificant segment can be any retrieved paragraph that does not affect the answer to the question.If Manuscript submitted to ACM the dataset does not annotate the minimal support context, we can manually inject a few extraneous noise segments into  (0) .

Discussion
Further deriving Eq. ( 5), we find that comparative loss can be viewed as a dynamic weighting of multiple task-specific losses.In particular, the loss can be rewritten as follows, where 1  is an indicator function equal to 1 if condition  is true and 0 otherwise, and the CMP function determines whether model  ( ( ) ;  ( ) ) complies with the comparison principle compared to  ( (  ) ;  (  ) ) and adjusts the weight of  ( ) .There are two cases of non-compliance: for the case where  ( ( ) ;  ( ) ) is less ablated ( < ) but more loss is obtained, we increase the weight of  ( ) ; for the case where  ( ( ) ;  ( ) ) is more ablated ( > ) but less loss is obtained, we decrease the weight of  ( ) .Formally, the CMP function can be written as Here we can notice that for a pair of models that do not conform to the comparison principle, we increase (+1) the weight of the task-specific loss of the model that is ablated less and equally decrease (-1) the weight of the loss of the model that is ablated more.Thus, let  ( ) = ≠ CMP(, ,  ( ) ,  (  ) ) denote the weight of  ( ) , then the sum of the weights of all task-specific losses (including the dummy one) is 0, i.e.,  =0  ( ) = − (+1) =  =0 1  ( ) > .Since  (+1) =  is a constant, Eq. ( 6) is also equivalent to  =0  ( )  ( ) , i.e., the total weight equals the number of task-specific Manuscript submitted to ACM losses worse than the virtual baseline , and is adaptively assigned to the  + 1 losses according to their performance.In this way, poorly performing full/ablated models will be more heavily optimized.And we empirically compare other heuristic weighting strategies in §5.1.
For parameter ablation, in addition to being able to weight each task-specific loss differentially, the comparative loss with CmpDrop can also differentially calculate the gradients of the parameters in different parts.According to Eq. ( 6), comparative loss is equal to the sum of all differences of task-specific loss pairs that violate the comparison principle, i.e, <  ∧  ( ) > (  )  ( ) −  (  ) , so we can analyze the gradient of comparative loss from the difference of each task-specific loss pair.For ease of illustration, we take the original model  (;  ) and a model  ( ′ ;  ′ ) whose parameters have been ablated  times as an example, and other model pairs with different parameters are similar.Assume that the original parameters  = (, , ) and the ablated parameters  ′ = ( ′ ,  ′ ,  ′ ) = (, /(1 − )  , 0), where  is the parameters from the layers without dropout,  is the parameters ablated by  times CmpDrop, and  ′ is the scaled parameters surviving from  times dropout.Then, if their task-specific losses violate the comparison principle, i.e.,  >  ′ , the gradient of the comparative loss contributed by this model pair is We can see that the comparative loss with respect to  is higher than the comparative loss with respect to the other parameters.This is intuitive because the model instead performs better after ablating away  indicating that the current  is inefficient, so we need to focus on updating .
In addition to the dynamic weighting perspective, comparative loss can also be considered as an "inverse ablation study" during training.This is because, in contrast to ablation studies that determine the contribution of removed components during validation, comparative loss believes that the ablated neurons should contribute and optimizes parameters with this objective.
For training complexity, given a generally small number of comparisons  (i.e., number of ablation steps), the overhead of computing the final comparative loss is negligibly small, and the increased computation overhead per update step comes mainly from the multiple forward and backward propagations of the models.Specifically, the overhead of a training step using comparison loss is 1 +  times that of conventional training for the same batch size.For inference complexity, however, models trained using comparative loss are the same as conventionally trained models at test time.

EXPERIMENTS
To evaluate the effectiveness and generalizability of our approach for natural language understanding, we conduct experiments on 3 tasks with representative output types, including classification (8 datasets), extraction (2 datasets), and ranking (4 datasets).Among them, the classification task requires predicting a single category for a piece of text or a text pair, the extraction task requires predicting a pair of boundary positions to extract the span between the start and end boundaries, and the ranking task requires predicting a list of relevance level to rank candidates.Specifically, the three distinct tasks are text classification (see §4.1), reading comprehension (extraction, see §4.2), and pseudo-relevance feedback (ranking, see §4.3), respectively.We evaluate the comparative loss with just CmpDrop in text classification and reading comprehension, the comparative loss with just CmpCrop in reading comprehension and pseudo-relevance feedback, and the comparative loss with both CmpDrop and CmpCrop in reading comprehension.For each task, we first introduce the dataset used, then present the implementation of our models as well as the baselines, and finally show the experimental results.
Manuscript submitted to ACM Before we start each experiment, we explain some common experimental settings.For the baseline value  of the task-specific loss in Algorithm 1, we provide two setting options.One is to simply set  = 0, which is equivalent to setting an unreachable target value for all full/ablated models and thus pushing their task-specific losses to decrease.
However, this results in the exposure of all training data to the full model and may aggravate overfitting.Therefore, to reduce the times the full model is optimized, our second option is to set the baseline value to the task-specific loss of the full model, i.e.,  =  (0) .In this way, the full model is optimized only when it performs worse than its ablated model.In practice, we prefer setting  = 0, and change to setting  =  (0) if we find that the model is prone to overfitting on the dataset.For the dropout rate  in each CmpDrop, we use the same setting as the baseline models, which is 0.1 in all our experiments.For other conventional training hyperparameters, such as batch size and learning rate, we also keep the same as the carefully tuned baseline models if not specifically specified.We implement our models and baseline models in PyTorch with HuggingFace Transformers [71].All models are trained on Tesla V100 GPUs.In the text classification and reading comprehension tasks, we trained each model with 5 random seeds.In the pseudo-relevance feedback task, we trained all models with a fixed random seed 42.In the tables, the results presented as mean ±standard deviation are tallied on the evaluation results of the five random seeds, otherwise the performance of the model trained with the random seed 42.For convenience, in the tables, we use 'Cmp' to represent the comparative loss and use 'Drop' and 'Crop' in parentheses to refer to CmpDrop and CmpCrop, respectively.

Classification: Application to Text Classification
Text classification is a fundamental task in natural language understanding, which aims to assign a predefined category to a piece or a group of text.In many text classification datasets, all segments of the input context seem to play an important role in the text category and there is almost no annotation of the minimal support context, so it is difficult for us to construct an input-ablated model by directly cropping the original input without changing the classification label.That is, it is likely to violate the constraint that the label of the ablated input is unchanged in the comparison principle, and thus we cannot apply CmpCrop to this task.However, many current neural classification models use dropout during training, so in this task, we only validate the comparative loss that uses just CmpDrop.[66] is a collection of diverse natural language understanding tasks.Following [21], we exclude the problematic WNLI set and conduct experiments on 8 datasets: (1) Multi-Genre Natural Language Inference (MNLI) [70] is a sentence pair classification task that aims to predict whether the second sentence is an entailment, contradiction, or neutral to the first one.(2) Microsoft Research Paraphrase Corpus (MRPC) [22] aims to predict if two sentences in the pair are semantically equivalent.

Datasets. The General Language Understanding Evaluation (GLUE) benchmark
(3) Question Natural Language Inference (QNLI) [66] is a binary sentence pair classification task that aims to predict whether a sentence contains the correct answer to a question.(4) Quora Question Pairs (QQP) [13] is a binary sentence pair classification task that aims to predict whether two questions asked on Quora are semantically equivalent.(5) Recognizing Textual Entailment (RTE) [4] is a binary entailment task similar to MNLI, but with much fewer training samples.(6) Stanford Sentiment Treebank (SST-2) [61] is a binary sentence sentiment classification task consisting of sentences extracted from movie reviews.(7) The Semantic Textual Similarity Benchmark (STS-B) [9] is a sentence pair classification task that aims to determine how two sentences are semantically similar.(8) The Corpus of Linguistic Acceptability (CoLA) [69] is a binary sentence classification task aimed at judging whether a single English sentence conforms to linguistics.
Manuscript submitted to ACM  5 .Specifically, we take BERT base [21], RoBERTa base [44] and ALBERT base [37] as our backbones to perform finetuning.The task-specific loss is mean squared error (MSE) for STS-B and cross-entropy for other datasets.We use different training hyperparameters for each dataset.
For baseline models and our models trained with comparative loss, we independently select the learning rate within {1e-5, 2e-5, 3e-5, 4e-5}, warmup rate within {0, 0.1}, the batch size within {16, 24, 32}, and the number of epochs from 2 to 5. For our models, we tune the number of ablation steps  (i.e., the number of CmpDrop) from 1 to 4. Following the hyperparameter setup in R-Drop [40], we implement R-Drop for all backbone models as well to serve as a competitor, which performs dropout multiple times as CmpDrop does.

Results
. We present classification performance in Table 1, where the evaluation metrics are Pearson correlation for STS-B, Matthew's correlation for CoLA, and Accuracy for the others.For models based on BERT base , we can see that our model (+ Cmp) comprehensively outperforms the well-tuned baseline BERT base and achieves an improvement of 1.04 points (on average), which proves the effectiveness of comparative loss in classification tasks.Moreover, our model trained with comparative loss also outperforms the model trained with state-of-the-art R-Drop by 0.58 points on average, which demonstrates the superiority of comparative loss.For models based on other more advanced RoBERTa base and ALBERT base , we can find consistent improvement.In addition, since ALBERT reuses parameters across multiple layers, it has the smallest boostable space for parameter utilization, which is consistent with our observation that comparative loss brings the smallest boost to ALBERT.

Extraction: Application to Reading Comprehension
Extractive reading comprehension (RC) [43,59] is an essential technical branch of question answering (QA) [11,31,41,60,79].Given a question and a context, extractive RC aims to extract a span from the context as the predicted answer.Current dominant RC models basically use pretrained Transformer [64] architectures, which employ dropout in many layers during finetuning.This allows us to use CmpDrop to improve the utility of the model parameters.
Additionally, the given context is usually lengthy and contains many distracting noise segments, which also allows us to use CmpCrop to improve the model's utilization of the context by randomly deleting the labeled distracting paragraphs.
Therefore, we intend to verify the effectiveness of comparative loss using CmpDrop or/and CmpCrop in this task.

Datasets.
We evaluate the comparative loss using only CmpDrop on SuQAD [59], which contains 100K singlehop questions with 9832 for validation, and HotpotQA [74], which contains 113K multi-hop questions with 7405 for validation.For HotpotQA, we consider the distractor setting, where the context of each question contains 10 paragraphs, but only 2 of them are useful for answering the question, and the rest 8 are retrieved distracting paragraphs that are relevant but do not support the answer.This allows us to evaluate the comparative loss with CmpCrop on HotpotQA distractor.

Models & Training.
We follow simple but effective RC models based on PLMs [3,15,21,37,44], which take as input a concatenation of the question and the context and use a linear layer to predict the start and end positions of the answer.And we use cross-entropy of answer boundaries as the task-specific loss function following [21] and use a learning rate warmup over the first 10% steps.For SQuAD, we use the popular BERT [21], RoBERTa [44], ELECTRA [15] and ALBERT [37] with a maximum sequence length of 512 as the backbone, all of which have successively achieved top rankings in multiple QA benchmarks [58,59,74].We first tune the learning rate in range {1e-5, 3e-5, 5e-5, 8e-5, 1e-4, 2e-4}, batch size in {8, 12, 32} and number of epochs in {1, 2, 3} for baseline models.Then, setting  = 2, we take these hyperparameters along and train our models using the comparative loss with two CmpDrop.For HotpotQA, we use the state-of-the-art Longformer [3] with a maximum sequence length of 2048 as the backbone, which is fed with the , and [P] represent yes/no answers and the beginning of questions, titles, and paragraphs, respectively.Similarly, we select the learning rate in {1e-5, 3e-5}, batch size in {6, 9, 12} and number of epochs in {3, 5, 8} for the baseline model.We then train our models with three comparative losses respectively, the first two applying one CmpDrop/CmpCrop ( = 1), while the third applying one CmpCrop followed by one CmpDrop ( = 2).Besides, inheriting common hyperparameters and searching for coefficient weights  in {0.1, 0.5, 1, 1.5}, we also implement R-Drop [40] as a competitor to CmpDrop.

4.2.3
Results.Since we focus on extraction here, we only measure the extracted answers using EM (exact match) and F1, which is a little different from the official HotpotQA setting that simultaneously evaluates the identification of support facts.From Table 2 we can see that our implemented baseline models trained directly using the task-specific loss Eq. ( 1) largely achieve better results than those reported in their original papers.Once trained using comparative loss Eq. ( 5) instead, our models can still significantly outperform these well-tuned baseline models even without re-searching the training hyperparameters, demonstrating the effectiveness of comparative loss on the extraction task.Also, the consistent improvement based on the three different PLMs demonstrates the model-agnostic nature of comparative loss.
Furthermore, from the results on HotpotQA we can find that although both CmpDrop and CmpCrop deliver significant improvement, CmpCrop + CmpDrop achieves the best results, suggesting that CmpDrop and CmpCrop may bring different benefits to the trained models.

Ranking: Application to Pseudo-Relevance Feedback
Pseudo-relevance feedback (PRF) [1] is an effective query understanding [10] technique to improve ranking accuracy, which aims to alleviate the mismatch of linguistic expressions between a query and its potential relevant documents.
Given an original query  and a document collection , a base ranking model returns a ranked list Let  ≤ denote the feedback set containing the top  documents, where  is usual referred to as the PRF depth.The goal of PRF is to reformulate the original query  into a new representation  ( ) using the query-relevant information in  ≤ , i.e.,  ( ) =  ((,  ≤ );  ), where  ( ) is expected to yield better ranking results.Although PRF methods do usually improve ranking performance on average [16], individual reformulated queries inevitably suffer from query drift [50,81] due to the objectively present noise in the feedback set, causing them to be inferior to the original ones.
Therefore, we can use comparative loss with CmpCrop to train PRF models to suppress the extra increased noise by comparing the effect of queries reformulated using feedback sets with different PRF depths.

Datasets.
We conduct experiments on MS MARCO passage [51] collection, which consists of 8.8M English passages collected from the search results of Bing's 1M real-world queries.The Train set of MS MARCO contains 530K queries (about 1.1 relevant passages per query on average), the Dev set contains 6980 queries, and the online Eval set contains 6837 queries.Apart from these, we also consider TREC DL 2019 [20], TREC DL 2020 [19], and DL-HARD [47], three offline evaluation benchmarks based on the MS MARCO passage collection, which contain 43, 54, and 50 fine-grained (relevance grades from 0 to 3) labeled queries, respectively.Among them, DL-HARD [47] is a recent evaluation benchmark focusing on complex queries.We use MS MARCO Train set to train models, and evaluate trained models on the MS MARCO Dev set to tune hyperparameters and select model checkpoints.The selected models are finally evaluated on the online MS MARCO Eval6 and three other offline benchmarks.

Models & Training.
We carry out PRF experiments on two base retrieval models, ANCE [72] (dense retrieval) and uniCOIL [42] (sparse retrieval), respectively.For their PRF models, we do not explicitly modify the query text, but

MARCO Dev
MARCO Eval TREC DL 2019 TREC DL 2020 DL-HARD NDCG@10 MRR@10 R@1K MRR@10 NDCG@10 R@1K NDCG@10 R@1K NDCG@10 R@1K ANCE [ directly generate a new query vector for retrieval following the current state-of-the-art method ANCE-PRF [75].This allows us to directly optimize the retrieval of reformulated queries end-to-end with the negative log likelihood of the positive document [34] as the task-specific loss: , where  + is the vector of a sampled document relevant to  and  ( ) , sim(•, •) is the dot product of two vectors, and  − is the collection of negative documents for them.Since only vectors of queries are updated 7 , we mine a lite collection (5.3M for dense retrieval and 3.7M for sparse retrieval) containing positive and hard negative documents of all training queries.In this way, for each query, all documents in the lite collection except its positive documents can be used as its  − .In general, our PRF model consists of an encoder, a vector projector, and a pooler.First the original query  and feedback documents in  ≤ are concatenated in order with [SEP] as separator and input to the encoder to get the contextual embedding of each token.Then, the projector maps the contextual embeddings to vectors with the same dimension as the document vectors.Finally, all token vectors are pooled into a single query vector.For dense retrieval, the encoder is initialized from ANCE FirstP8 , the projector is a linear layer, and the pooler applies a layer normalization on the first vector ([CLS]) in the sequence, as in the previous work [75].For sparse retrieval, the encoder and projector are initialized from BERT base with the masked language model head, where the projector is an MLP with GeLU [28] activation and layer normalization, and the pooler is composed of a max pooling operation and an L2 normalization 9 .
We finetune PRF baseline models for up to 12 epochs with a batch size of 96, a learning rate selected from {2e-5, 1e-5, 5e-6}, and PRF depth  randomly sampled from 0 to 5 for each query.We then finetune our PRF models using the comparative loss of  = 1 CmpCrop for up to 6 epochs with a batch size of 48.In this way, the maximum number of training steps for our models remains the same as the baseline models, i.e., up to 12 optimizations per original query.
Due to the large training costs of using multiple random seeds, we used paired t-test to calculate significant differences in retrieval performance.

Results
. We report the official metrics (MRR@10 for MARCO and NDCG@10 for others) and Recall@1K of the models on multiple benchmarks in Table 3.In addition to reporting results for the best-performing PRF depths (numbers in superscript brackets), for a fair comparison with ANCE-PRF (3) (second row), we also present the results of ANCE-PRF + Cmp (3) , both of which use the first 3 documents as feedback.We can see that PRF baseline models (+ PRF) indeed generally outperform their base retrieval models, except that uniCOIL-PRF degrades by 0.67 percentage points in NDCG@10 of TREC DL 2019, which reflects the presence of query drift.Our PRF models (+ Cmp) trained with comparative loss, however, outperform their base retrieval model across the board.Under the same use of 3 feedback documents, our ANCE-PRF + Cmp also outperforms the published state-of-the-art ANCE-PRF [75] on all metrics except NDCG@10 on DL-HARD.Moreover, when 5 feedback documents are used, ANCE-PRF + Cmp achieves a go-ahead over ANCE-PRF on NDCG@10 of DL-HARD.For sparse retrieval, our PRF model (+ Cmp) trained with comparative loss also surpasses the strong baseline uniCOIL-PRF implemented following ANCE-PRF.All of these results above demonstrate the effectiveness of comparative loss on the ranking task.

ANALYSIS
In this section, we further conduct several experiments for a more thorough analysis.First, from the dynamic weighting perspective found in §3.3, we examine whether the adaptive weighting of comparative loss is more effective than other weighting strategies ( §5.1).Next, we try several other comparison strategies to find some guiding experience in choosing the number of ablations and ablation methods in practice ( §5.2).Then, to confirm the enhancement of comparative loss on the utility of hidden and input neurons, we investigate the performance of models with different numbers of parameters ( §5.3) and context lengths ( §5.4).Furthermore, we visualize the loss curves to find the impact of the comparative losses with different ablation methods on the task-specific loss ( §5.5).Finally, we show the actual training overhead of comparative loss in detail ( §5.6).

Effect of Weighting Strategy
To verify the role of comparative loss from the dynamic weighting perspective, we keep all the training settings of Longformer + CmpCrop + CmpDrop from the last row of Table 2 unchanged and replace only the weighting strategy of task-specific losses with some heuristics.Table 4 shows their performance on the HotpotQA development set.AVERAGE, FIRST and LAST are three static weighting strategies.AVERAGE assigns equal weights to all task-specific losses, while FIRST and LAST assign weight to only the first and last task-specific loss, respectively, i.e., FIRST optimizes  (0) of the full model without dropout and LAST optimizes  (2) of the model with regular dropout rate  (equivalent to the baseline Longformer in Table 2).MAX is another dynamic weighting strategy that assigns weight to only the largest task-specific loss.We can see that dynamic weighting in comparative losses is significantly better than these heuristic Manuscript submitted to ACM weighting strategies, which proves that comparative loss can assign weights more appropriately.In addition, AVERAGE is better than the latter three strategies that consider only one task-specific loss, indicating that it is beneficial to consider multiple task-specific losses.Moreover, although the latter three are all assigned to only one task-specific loss, MAX is better than the other two, which indicates that dynamic assignment is better than static assignment.
Notably, the FIRST that directly optimizes the full model outperforms the LAST that is trained with dropout, suggesting that the inconsistency of dropout between the training and inference stages [82] may indeed lead to underfitting of the full model.And the fact that Cmp far outperforms FIRST and LAST indicates that comparative loss can automatically strike a balance between ensuring training-inference consistency and preventing overfitting.

Effect of Comparison Strategy
To study the impact of comparison strategies, i.e., how many ablation steps we should use for comparison and which ablation method we should choose at each step, we try a variety of comparison strategies on HopotQA with different numbers of comparisons and ablation orders.As shown in Table 5, the results are not significantly further improved when we repeat CmpDrop/CmpCrop twice, but the results are further improved when we apply CmpCrop first and then CmpDrop.This indicates that comparing multiple models ablated by the same method, i.e., encouraging the model be either hereditarily input-efficient or hereditarily parameter-efficient, seems to have little effect on the performance of the full model, but the successive use of two different ablation methods, i.e., encouraging the model be efficient (both input-efficient and parameter-efficient), is helpful.However, applying CmpDrop followed by CmpCrop did not perform as well as applying CmpDrop only, suggesting that the order of the ablation methods is important and perhaps the ablation should be done in the order of the information flow in the model.
To further confirm the influence of the number of ablation steps , we show in Fig. 4 the relationship between the model's Average metric over the eight GLUE datasets and the number of ablations.We can find little difference in the average performance of the models trained with different numbers of CmpDrop, with the model trained with one CmpDrop performing significantly best mainly because its huge advantage on two of the datasets pulls up the average.Therefore, if there is no extreme demand for performance, we usually do not need to tune the hyperparameter .

Effect of Model Parameters
To investigate the impact of model parameters, we explore the application of the comparative loss with CmpDrop on different-sized versions of BERT, RoBERTa and ELECTRA.From Table 6 we can see that the comparative loss with CmpDrop achieves a consistent improvement over the baselines based on these backbone models, which indicates that the comparative loss can improve model performance by increasing parameter utility without increasing the number of Manuscript submitted to ACM  parameters.Moreover, except for the one outlier of BERT Medium , we can roughly find that the less the model parameters, the greater the relative gain from comparative loss.This is reasonable because the individual hidden neurons in a model with lower capacity play a larger role, so the improvement in the utility of hidden neurons can be more reflected in the final performance.Whereas for a model of higher capacity, it is easier to fit less training data, i.e., its task-specific loss is already low, so comparative loss has less room to play in reducing task-specific loss further.In addition, we observe that the boost to BERT from the comparative loss with CmpDrop is generally higher compared to RoBERTa and ELECTRA with more complicated pretraining, suggesting that the comparative loss helps the model escape from local optima due to parameter initialization.

Effect of Input Context
To review the utility of the input context (i.e., input neurons) to models, we plot in Fig. 5 the performance trends of the models using different context sizes.First, in both datasets, our models trained with comparative loss consistently outperform the baseline models for all context sizes, indicating that our models are able to utilize input neurons more efficiently with equal amounts of input context.Second, this shows that our comparative loss can further improve the model performance after streamlining the input with context selection.In addition, we notice that our ANCE-PRF + CmpCrop in Fig. 5(a) improves retrieval performance as expected as the number of feedback documents increases, while Manuscript submitted to ACM ANCE-PRF reaches peak performance at 4 feedback documents and then suffers performance degradation, implying that our model is more robust and able to mine and exploit relevant information in the added feedback documents.In contrast to PRF, for HotpotQA in Fig. 5(b), the performance of all RC models decreases as the number of paragraphs increases.This is understandable, since only 2 paragraphs in HotpotQA are supporting facts, and the remaining 8 mostly serve as a distraction, so the ideal performance curve can just be a horizontal line that does not drop when the paragraph number increases.Interestingly, we find that the degradation of Longformer + CmpDrop (2.7%) and Longformer + CmpCrop + CmpDrop (3.0%) from the oracle setting (2 gold paragraphs) to the distractor setting (10 paragraphs) is lower than that of the baseline Longformer (3.4%).This suggests that comparative loss can help the models suppress the noisy information in the added context.Although Longformer + CmpCrop (3.7%) has a larger degradation than Longformer, we believe this is because Longformer + CmpCrop needs to be optimized for various numbers of paragraphs, unlike other models without CmpCrop that focus on learning for one input form (i.e., always ten paragraphs).However, this variety of input forms makes Longformer + CmpCrop perform better than Longformer + CmpDrop when the number of paragraphs is small (≤ 5).
To further quantitatively demonstrate the help of comparative loss in the robustness of the PRF model to context size, we report in Table 7 the robustness indexes [18] of ANCE-PRF + CmpCrop and ANCE-PRF at different numbers of feedback documents.The robustness index is defined as  + − − | | , where | | is the total number of evaluated queries and  + and  − are the number of queries that the PRF model improves or downgrades when one more feedback document is used.The value of robustness index is in [-1, 1], and the model with higher robustness index is more robust.We can see that the PRF model trained using comparative loss with CmpCrop is significantly more robust than the baseline model.Besides, from the gaps in their robustness indexes (only 0.03 or 0.02 for 1 or 2 documents, but 0.05 for more documents), we can find that the comparative loss is more helpful for long-form inputs.

Loss Visualization
To figure out the impact of comparative loss on task-specific loss, we plot the curves of task-specific loss for the full model (i.e.,  (0) ) in Fig. 6.From Fig. 6(a) and Fig. 6(b) we can see that with the same batch size, the comparative loss can Manuscript submitted to ACM   comparative loss with CmpCrop.We can see that while the training loss of model in Fig. 6(c) does not drop as low as the baseline, its evaluation loss in Fig. 6(d) drops to a lower level and significantly mitigates the overfitting.

Training Efficiency
We present in Table 8 the performance gain and relative change in training FLOPs of BERT base + Cmp compared to BERT base , as well as the specific number of comparisons (i.e., number of ablation steps ) chosen for each dataset.We find that the actual overhead of training with comparative loss is usually less than 1 +  times that of conventional training, and even less than that of conventional training (e.g., on QQP).This is because models trained with comparative loss tend to converge earlier than baselines.Combined with the insensitivity of comparative loss to the number of comparisons found from Fig. 4, we believe that setting  to 1 or 2 can lead to effective and fast training when data is sufficient.

RELATED WORK
In this section, we introduce and discuss some work that has different motivations but is technically relevant to us, starting with contrastive learning [38] that learns by comparing, followed by recent training methods that also use dropout multiple times.

Contrastive Learning
Contrastive learning has recently achieved significant success in representation learning in computer vision and natural language processing.At its core, contrastive learning aims to learn effective representations by pulling semantically similar neighbors together and pushing apart non-neighbors [26].Instead of learning a signal from individual data samples one at a time, it learns by comparing different samples [38].The comparison is performed between positive pairs of similar samples and negative pairs of dissimilar samples.The positive pair must ensure that the two samples are similar, which can be constructed either by using supervised similarity annotation or by self-supervision.In selfsupervised contrastive learning, a positive pair can consist of an original sample and its data augmentation.For example, SimCLR [12] in computer vision uses a crop, flip, distortion or rotation of an original image as its similar view, and SimCSE [25] in natural language processing applies two dropout masks to an input sentence to create two slightly different sentence embeddings that are then used as a positive pair of sentence embeddings.To share more computation and save cost, negative pairs usually consist of two dissimilar samples within the same training batch.Although both learn through comparison, contrastive learning aims at pursuing alignment and uniformity [68] of representations, while our comparative loss aims at pursuing orderliness of the task-specific losses of the full model and its ablated model.Moreover, as the lexical meaning suggests, contrastive learning only classifies the relationship (i.e., similar or dissimilar) between two data samples in a binary manner, whereas our comparative loss compares multiple full/ablated models by ranking.However, these two are not in conflict, and our comparative loss can be used over the contrastive losses that served as task-specific losses.

Dropout-based Comparison
Dropout is a family of stochastic techniques used in neural network training or inference that have attracted extensive research interest and are widely used in practice.The standard dropout [30] aims to avoid overfitting of the network by reducing the co-adaptation of neurons, where the outputs of individual neurons only provide useful information in combination with other neuron outputs.After this, a line of research focused on improving the standard dropout by employing other strategies for dropping neurons, such as dropconnect [65] and variational dropout [35].
A line of research that is relevant to us is the use of dropout multiple times in training.SimCSE [25] forwards the model twice with different dropout masks of the same rate and uses a contrastive loss to constrain the distribution of model outputs in the representation space.A possible side effect of dropout revealed by the existing literature [46,82] is the non-negligible inconsistency between the training and inference stages of the model, i.e., the submodels are optimized during training, but the full model without dropout is used during inference.To address this inconsistency, R-Drop [40] forward runs the model multiple times with different dropout masks to obtain multiple predicted probability distributions and applies KL-divergence on them to constrain their consistency.Unlike their multiple dropout masks that are sampled independently, the multiple dropout rates are increasing and the masks are progressive in our CmpDrop, with the subsequent mask obtained by further randomly discarding elements based on the previous one.In addition, we impose constraints on the task-specific losses at the end rather than on the representations and probabilities upstream.
Notably, the full model is also optimized in due time when trained using the comparative loss with CmpDrop, which we argue is important to mitigate the inconsistency between training and inference.This is because, while dropout avoids co-adaptation of neurons, it also weakens the cooperation between neurons ( §5.1 gives some empirical support).
In particular, in cases where all neurons are involved, the full model trained with dropout has not been taught how to make them work together efficiently and thus cannot be fully exploited during testing.Surprisingly, our comparative loss with CmpDrop can balance between promoting the cooperation of neurons and preventing their co-adaptation.

CONCLUSION
In this paper, we propose cross-model comparative loss, a simple task-agnostic loss function, to improve the utility of neurons in NLU models.Comparative loss is essentially a ranking loss based on the comparison principle between the full model and its ablated models, with the expectation that the less ablation there is, the smaller the task-specific loss.To ensure comparability among multiple ablated models, we progressively ablate the models and provide two controlled ablation methods based on dropout and context cropping, applicable to a wide range of tasks and models.
We show theoretically how comparative loss works, suggesting that it can adaptively assign weights to multiple task-specific losses.Extensive experiments and analysis on 14 datasets from 3 distinct NLU tasks demonstrate the universal effectiveness of comparative loss.Interestingly, our analysis confirms that comparative loss can indeed assign weights more appropriately, and finds that comparative loss is particularly effective for models with few parameters or long input.
In the future, we would like to apply comparative loss in other domains, such as natural language generation and computer vision, and explore its applications on other model architectures beyond Transformer.It could also be interesting to explore the application of comparative loss on top of self-supervised losses (e.g., contrastive loss) during pretraining.For training costs, how to reduce the overhead by reusing more shared computations is a direction worth exploring.Further, more advanced ablation methods in training, such as dropconnect [65] rather than standard dropout and adversarial rather than stochastic, may deserve future research efforts.

Fig. 1 .
Fig. 1.An illustration of a full neural model (a) and its ablated models (b, c, and d), where a hidden neuron is ablated in (b), an input neuron is ablated in (c), and (d) additionally ablate another input neuron based on (b).According to the comparison principle, if the full model (a) is an efficient model, the comparative relation between the task-specific losses obtained by these models should be (a) ≤ (b), (c), (d).If the ablated model (b) is also efficient in its parameter space, then their comparative relation can be further written as (a) ≤ (b) ≤ (d).Note that (b, c) and (c, d) are two non-comparable model pairs.This is because the ablated model (c) is not a submodel of (b) and (d), and vice versa.

Fig. 3 .
Fig.3.The overview of comparative loss (best viewed in color).Given a data sample (, ), conventional training typically feeds the input context  into the neural model to obtain the prediction  (0) and then just minimizes the task-specific loss  (0) .In contrast, comparative loss not only progressively ablates the original model to minimize multiple task-specific losses { ( ) }  =0 , but also constrains their comparative relation with a pairwise hinge loss.
: Ablate Input by Cropping.Given an input context , CmpCrop aims to crop out a condensed context  ′ that does not change the original ground-truth output, i.e.  ′ ⊏  and ( ′ ) = () = .Assume that we know the minimum support context  ★ for  at training time, i.e., ∀ ′

Fig. 4 .
Fig. 4. Average results on eight GLUE datasets as the number of ablation steps changes.

Fig. 5 .
Fig. 5. Performance curves using different context sizes.(a) PRF models on MARCO Dev, the horizontal dotted line represents the base retrieval model.(b) RC models on HotpotQA Dev.
Fig.6.Task-specific loss curves for the full model.
Algorithm 1 Training with Comparative Loss Input: Training dataset D, steps of ablation , dropout rate , baseline value of task-specific loss , learning rate .Output: model parameters  .

Table 2 .
Question answering performance on the development sets of SQuAD and HotpotQA distractor.The results with † are inquired from the authors of its paper.

Table 3 .
Retrieval performance on benchmarks built on MS MARCO passage collection.ANCE and uniCOIL are base retrieval models, + PRF denotes the PRF baseline model, + Cmp denotes our PRF model trained with the comparative loss of 1 CmpCrop, and superscript( )represents the PRF depth used during testing.Superscript * indicates statistically significant improvements over its PRF baseline model with  ≤ 0.1.

Table 4 .
QA performance on the development set of HotpotQA distractor with different weighting strategies.Cmp refers to Longformer + CmpCrop + CmpDrop that adaptively weights multiple task-specific losses through comparative loss.The others are heuristics, where AVERAGE assigns the same weights to all task-specific losses, FIRST and LAST assign weight only to the first or last, and MAX dynamically assigns weight only to the largest one.

Table 5 .
QA performance on the development set of HotpotQA distractor with different comparison strategies. is the number of ablation steps.x 2 indicates that an ablation method is repeated twice, and  +  means that  is used followed by .

Table 7 .
The robustness index of ( )with respect to  ( −1) on MARCO Dev at each PRF depth , where ( )and  ( −1) are reformulated query vectors by the PRF model, the latter having one less document in the input context than the former.

Table 8 .
Specific settings for the number of ablation steps of BERT + Cmp on each GLUE dataset, as well as the performance gain and increase in training computation overhead compared to BERT.