Sexism Detection on a Data Diet

There is an increase in the proliferation of online hate commensurate with the rise in the usage of social media. In response, there is also a significant advancement in the creation of automated tools aimed at identifying harmful text content using approaches grounded in Natural Language Processing and Deep Learning. Although it is known that training Deep Learning models require a substantial amount of annotated data, recent line of work suggests that models trained on specific subsets of the data still retain performance comparable to the model that was trained on the full dataset. In this work, we show how we can leverage influence scores to estimate the importance of a data point while training a model and designing a pruning strategy applied to the case of sexism detection. We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets and find, that in accordance with other work a large fraction of instances can be removed without significant performance drop. However, we also discover that the strategies for pruning data, previously successful in Natural Language Inference tasks, do not readily apply to the detection of harmful content and instead amplify the already prevalent class imbalance even more, leading in the worst-case to a complete absence of the hateful class. Warning: This paper contains instances of hateful and sexist language to serve as examples.


INTRODUCTION
Social media platforms, have evolved into vital instruments enabling individuals to maintain connections with others and express their views on various topics, including politics, technology, and everyday life.However, there has also been an increase in the amount of harmful content targeting people belonging to different demographics, race, religion, and sexual orientation; that is being generated by different users every day on these platforms (e.g., Twitter, Facebook, Reddit).As a natural consequence, it has become highly imperative to monitor (and potentially regulate) such harmful content.
In this work we focus on sexism, which is widely defined as any "prejudice, stereotyping, or discrimination, typically against women, on the basis of sex" 1 .Attempts have been made to detect sexism by using Natural Language Processing (NLP) techniques.Curation of datasets [5,16,35] can also aid in sexism detection.With advancements made in Deep Learning (DL), especially after the introduction of transformer architecture [34], models like BERT [9] or RoBERTa [19] have become de-facto models that have been applied to detect sexism from text data [13,23,29].Even though the aforementioned publications use the whole dataset to train and evaluate their models, some researchers [2-4, 12, 22] suggest that some data instances are more useful for driving the learning process and impacting the final model performance than others.Particularly, researchers in Computer Vision (CV) explore the usefulness of influence scores [1,17,24,25] to quantify the importance of a particular datapoint when gauging the performance of a model after training or fine-tuning (in case of pre-trained models).Influence scores use information present during the training process (e.g., confidence or gradient loss) to estimate the contribution of a datapoint to the final model performance.
We build upon the first insights from Ethayarajh et al. [11], Fayyaz et al. [12] and Anand et al. [2] who first investigated influence scores and their usefulness for NLP problems.Our research specifically aims to examine the effects of various pruning strategies 1 Oxford English Dictionary.that utilize influence scores, with a particular focus on addressing sexism, a domain notably affected by significant class imbalance.
The rest of the manuscript is organized as follows.Section 2 provides readers with a brief review of the state of the art.Section 3 introduces basic concepts to follow the rest of the work.Section 4 presents material and methods to be used in the experiments.Section 5 describes the main reported results.Section 6 summarizes concluding remarks and points out future work.Finally, Section 7 declares the limitations of this work.

RELATED WORK
Training models in a data-efficient fashion and identifying important data points have always been one of the challenging problems in Machine Learning (ML) and attempts have been made to propose solutions to this problem leading to the rise of body of methods called robust statistics [15].
Fayyaz et al. [12] were the first who used influence scores in the NLP domain.They employed EL2N scores to identify the highest scoring data points within the SNLI and AGNews datasets.They found that models trained on approximately the top 70% of the dataset, after pruning certain portions of the highest scoring examples from both datasets, achieved test accuracy scores surpassing those of models trained on the entirety of the dataset.Attendu and Corbeil [3] designed a dynamic pruning strategy using a metric inspired from EL2N for binary classification and multi-class classification tasks on popular datasets like MNLI, SST-2, and QNLI.They used their metric to retain a subset of the highest scoring examples after each epoch and found out that they could prune up to 50% of the data and still retain performance as compared to the model trained on the full data.
Anand et al. [2] performed an extensive investigation of different influence scores on SNLI [6] and proprietary user-utterances datasets.They discovered that in context of both SNLI and userutterances dataset, pruning the low scoring examples (which they termed easy examples) based on their VoG scores and training models on the resulting data led to an increase in test accuracy when compared to the random baseline.Pruning of hard examples led to a decrease in the model performance compared to random baseline.This confirmed the findings of Sorscher et al. [31] that hard data points contain critical information that can help a model make a decision regarding the decision boundary.However, for other influence scores that they tested on -PVI [11], EL2N [24], TracIn [25] and Forget Scores [33]-they found out that the pruning the harder or easier examples based on their respective scores did not lead to a gain in performance (measured by test accuracy) compared to random pruning but rather led to a sharp drop after a certain pruning rate (30%).
In this study, we concentrate on three distinct influence scores that have shown the most promising results for Natural Language Understanding (NLU) tasks: PVI [11], EL2N [24], and VoG [1].Each of these scores measures importance in a fundamentally different way, encompassing information-theoretic, margin-based, and gradient-based approaches, respectively.To our best knowledge this is the first study that explores the utility of influence scores for the detection of hateful online communication.We refer the readers to Table 6 for list of abbreviations and their corresponding full forms for convenience.

INFLUENCE SCORES 3.1 Pointwise V-Information
Pointwise V-Information (PVI) is a method that extends the Pointwise Mutual Information (PMI) proposed in [36] to NLP and tries to measure the usable bits of information for a model in predicting the corresponding label.The proposed method trains two models ( and  ′ ), one on combination of null inputs () and labels (), and another on combination of text inputs () and labels () respectively to calculate the following quantities for each datapoint.
The PVI for a datapoint  is calculated as follows: A negative PVI implies that the instance was harder for the model to predict [11].We closely follow the implementation as delineated in [11] in calculating the PVI scores after training the model.More details about the experimental setup can be found in Section 4.

Error L2-Norm
Error L2-Norm (EL2N) was introduced in [24].As mentioned by Anand et al. [2], Paul et al. [24], Sorscher et al. [31], EL2N is a margin-based influence score, implying that the data points which are harder for the model to classify have high EL2N scores, and are closer to the decision boundary.Conversely data points that are easier for the model are farther away from the decision boundary.Hence EL2N scores give us an idea about how easier or harder a particular data point was for a model.EL2N score was introduced as a metric to detect data points that can be pruned early in training [24] and was compared with forgetting scores Toneva et al. [33].Anand et al. [2] found that EL2N is also an informative influence score that helps in pruning data points after training a model.The EL2N score for a data point  is given by where  denotes the model.

Variance of Gradients
Variance of Gradients (VoG) introduced in Agarwal et al. [1], is another influence score that also assists in understanding the hardness or easiness of a datapoint in a training dataset.It was proposed as a ranking method to rank the hardness of examples for models trained on standardized CV datasets -CIFAR-10, CIFAR-100 [18], and ImageNet [8].
VoG captures the "per example change in explanations over time" [1], as opposed to saliency maps which scores the features of the input data based on their contribution to final output [30].The method works by calculating the gradient of the pre-softmax activation with respect to the input at the true label position, then calculating the mean and variance of the gradient with respect to the input at each checkpoint and then taking an average over the per-pixel variance to get the final VoG of an input.Agarwal et al. [1] also proposed VoG as a metric for performing data auditing and hence it aided them in identifying the noisy examples (corresponding to higher scores) from the aforementioned Image Recognition datasets, and they found out that removing the noisier examples led to a better performance of the models.
In our case, since we work with language data, we calculated the gradient with respect to the input embeddings at each checkpoint, as done in [2].Mathematically, our method can be formulated as follows.Let G   be the gradient with respect to input  at checkpoint .It is calculated with respect to pre-softmax output A   with respect to the input embedding E   using the following equation.
The VoG score for each input example  is computed as follows: The unnormalized VoG score   is calculated by taking the mean of V  (average over the input embeddings [2]).For normalization of the VoG scores we use the class-normalized VoG prescribed by [1], which is calculated by computing the mean and standard deviation of VoG scores per class and then normalizing the VoG scores of each data point belonging to the corresponding class.Let the class mean be denoted as   and standard deviation as   .Then the normalized VoG for datapoint i belonging to that particular class is calculated as: It is worth noting that Anand et al. [2] also introduced dataset normalized VoG scores on top of class-normalized VoG scores, but we experiment with the class normalized VoG score as prescribed in the original paper [1].

METHODOLOGY 4.1 Datasets
For the experimental setup we created a training dataset by combining two well-known datasets -EDOS (Explainable Detection of Online Sexism) dataset by Kirk et al. [16] and Call Me Sexist But dataset by Samory et al. [28].We will refer to the resulting dataset as Combined Data or in-domain data.We then split the resulting data into a 70/30 train/test split for in-domain evaluation.
For cross-evaluation and determining the out-of-domain performance of various classifiers, we filtered data from the Hatecheck dataset [27] based on target group women.Additionally we used the whole EXIST dataset [26] and Misogyny dataset [14] consisting of Reddit posts and comments to detect online misogyny.The class distribution of our training and out-of-domain evaluation datasets can be inspected in Table 1.

Model and Settings
For testing the influence scores based pruning we used the BERT [10] model variant bert-base-cased from Huggingface 2 and AdamW Optimizer [20] with the parameter settings summarized in Table 2.We saved the weights of the model (checkpoint) after every epoch and calculated the PVI [11] and EL2N [24] scores as these two scores assist us in understanding the behaviour of data points at each epoch of a model training [2,24,33].VoG scores [1] were calculated after the end of the training.We performed this step three times and took an average of the influence scores for each datapoint.After the scores were calculated, we sorted scores from the hardest to the easiest (i.e., from the lowest scoring to the highest scoring in case of PVI, and from the highest scoring to the lowest scoring in case of EL2N and VoG respectively).We used the following pruning rates -(5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 50%, 60%) -to prune data for the experiments after sorting the data from the hardest to the easiest for all the influence scores.In the case of pruning hard examples, we pruned from the top for all the influence scores and from the bottom in case of easy examples.We fine-tuned pretrained BERT models [10] on each of the resultant pruned datasets (easy and hard), with the hyperparameter setting in Table 3.

Hyperparameter Value
Furthermore, we randomly pruned the training data with the pruning rates as mentioned before and fine-tuned pre-trained BERT models on them to compare performance.We chose the hyperparameter settings as shown in Table 3 because through empirical observation we found out that the models tend to overfit on the pruned data, so we decided to lower the learning rate and train it for lesser epochs.All the models were trained on a single 40 GB partition of an NVIDIA A100 GPU.

General Findings
Figure 1 shows the macro-F1 scores of models trained on datasets at different pruning rates and different pruning types (only easy instances, only hard instances, and random).One of our key observations is that randomly removing up to 50% of instances from the training data does not significantly affect performance.In other words, we can eliminate a large portion of data instances without impacting the results.This is in line with Anand et al. [2] who found similar patterns for the SNLI dataset.Additionally, we observed that pruning the easy examples and training the models on the remaining examples did not lead in general to improvements over random pruning.We observed a similar trend for the out-of-domain  performance on both the EXIST and Misogyny dataset, evident in Figures 2 and 3, respectively.Furthermore, for the Hatecheck test data, we discovered that the models generally do not achieve a satisfying out-of-domain performance even when trained on the combined data.This may be caused due to the fact that the evaluation data is particularly hard as all of its instances contain the identity term woman.We know from previous studies that models predicting sexism are prone to learn spurious correlations (associating the term woman with the positive class).Hence, they fail in this particular setup.Moreover, Hatecheck was developed as a stress test for models, already consisting of more complicated examples (e.g., containing leet speech or misspellings).Additionally, the pruning of the easy and hard data points based on any of the influence scores did not marginally improve upon the random pruning baseline performance, as evident in Figure 4.

Particular Findings from Influence Scores
Here, will discuss particular observations for each influence score individually.
PVI: When we increased the pruning of easy examples, we consistently observed a significant drop in performance for both in-domain and out-of-domain test sets.
On closer inspection, we discovered with increased pruning rates of easy examples more class-imbalance was introduced to the training data (see Figure 5).By applying a pruning rate of 20%, we effectively eliminated almost all sexist instances from the dataset, resulting in the model receiving insufficient data from the positive class to establish a clear decision boundary.When we pruned the hard examples, we observed that it did not exacerbate the imbalance problem.For the in-domain performance, we observed that pruning the difficult data points did not lead to performance gain over random pruning.A similar pattern was observed for the Misogyny dataset: pruning difficult examples did not enhance performance compared to random pruning.In fact, performance declined with the removal of more data, as illustrated in Figure 3.However, we observed that pruning the difficult data points and training the models on the pruned data resulted in performance gain over random pruning in the case of EXIST dataset [26]; this trend is depicted in Figure 2.
EL2N: As we pruned the hard examples based on EL2N scores from the training data, we discovered that at pruning rates from 5% to 25% the performance closely follows that of random pruning in both in-domain and out-of-domain test sets.With pruning rates greater than 25% we effectively eliminated most of the sexist examples as evident from Figure 5 leading to an absence of examples of the positive class that can assist the models in establishing a decision boundary resulting in a drop in performance.
When we pruned the easy examples, we observed that it did not intensify the imbalance problem.In the case of in-domain performance, we observed that pruning the easy data points did not lead to a gain in performance over random baseline.A similar trend was observed regarding performance in all the out-of-domain datasets as illustrated in Figures 2, 3, and 4.
VoG: On pruning the hard data points based on VoG scores, we did not observe any performance improvement over random pruning in rates from 5% to 30% for in-domain as well as out-of-domain test sets.In the case of in-domain data, on further pruning the hard data points we did observe an improvement in performance but with a high margin of error when compared to random pruning.A similar trend was observed in out-of-domain performance as well.On pruning the easy examples, we did not observe any improvement in performance compared to random pruning baseline for the pruning rates from 5% to 30%.
Further pruning of the easy examples led to a significant drop in performance when compared to the random pruning baseline in case of in-domain performance.We also observed similar trends in terms of performance in all the out-of-domain test sets.These findings were in contrast to findings of [2] who found out that pruning easy examples from SNLI based on their VoG scores led to better performance when compared to random pruning in terms of test accuracy.

Analysis of examples after fine-tuning
In this section we provide the distribution of the 3 influence scores for the sexist datapoints that were calculated after fine-tuning of the BERT model [10] on our training data.Figure 6 showcases the distribution of the misclassified and classified sexist data points in the train split of the combined data after training the models.We observed that EL2N and VOG have a distinct boundary for classified and misclassified data points.VoG scores on the other hand resulted in no distinct boundary.We also present the top-5 high scoring and low scoring examples which were obtained after calculating the PVI scores in Tables 4 and  5. Example instances for the remaining metrics can be inspected in Appendix B. We chose to elaborate on the findings of the PVI scores, as we observe the most interesting deviations from random pruning, as described in the previous section.Several noteworthy characteristics emerge in the more challenging examples listed in Table 4.For instance, the data point with the lowest score includes a misspelling of "woman" introducing noise into the text that complicates model interpretation.Furthermore, the examples ranked 3  and 4 ℎ exhibit implicit sexism, posing classification challenges for the model as well.The last example, marked as non-sexist, posed a challenge for our models due to the presence of a trigger word ('pussy').This instance could also be seen as a borderline case, potentially warranting a sexist classification.
Looking at the examples in Table 5 we observed that they were relatively easy for the model to classify because of presence of dedicated trigger word (explicit sexism).We further investigated the misclassified sexist instances of the training dataset with negative scores and found out that majority were from EDOS dataset.The same phenomenon was observed for misclassified non-sexist instances.

CONCLUSIONS AND FUTURE WORK
In this work, we have used several influence-functions in the domain of sexism detection and evaluated their usefulness in the context of data pruning.As a main result, we found that we can remove up to 50% of the data without real impact on the out-ofdomain performance on all datasets, indicating that the training data might contain a lot of information that either confuses the model or information that is not relevant for the task [2,3,12].
In contrast to other work, we do not recommend relying on pruning strategies that focus on the removal of only easy or hard instances, as this may lead to catastrophic class-imbalance.Rather, we need to design strategies to sample data from datasets based on their influence scores, especially when dealing with noisy and imbalanced datasets that are used to study hate speech and related constructs [21].Furthermore, we observed that most of the sexist instances in our training data are considered PVI-easy or EL2N-hard.If EL2N-hard-instances are really "examples that contain critical information about the decision boundaries of classes in larger, less noisy datasets" [2], we either have training data with significant noise (misannotations) or the datasets in general lack of high-quality examples that help to shape the decision boundaries.
We hope that our experiments help in steering the field of hate speech detection towards the inclusion of influence scores in determining the quality of datasets that are usually curated for hate speech detection.Future work will incorporate what we learned from this study and design better sampling strategies that do not worsen the imbalance problem.In the future, we will also look at designing pruning strategies that can utilize the dynamic nature of the influence scores (especially PVI and EL2N) in informing the selection of data points.On further investigation we found out that most of the easy instances were all from the EDOS dataset, whereas the hard examples contained examples from both EDOS and Call Me Sexist But dataset.

LIMITATIONS
This study is not without limitations.First, merging two sexist datasets for the creation of our training set may have led to a reduction of the general quality due to different construct operationalizations. Additionally, we did not experiment with different hyperparameter optimization techniques while training the models on the dataset and went forward with default settings.We note that the main purpose of this work was not to show the best performing influence score; rather this work serves as a proof-of-concept in using diverse set of influence scores in scoring the effect of data points after training or fine-tuning a model.

A ABBREVIATIONS
In this Appendix section we will provide the table of abbreviations and their corresponding full forms that have been used in the paper.

B DATASET SAMPLES
In this Appendix section we will provide the top-5 hard and easy examples and their corresponding scores with respect to EL2N and VoG.Tables 7 and 8 showcase the hard and easy examples based on their respective EL2N scores.Tables 9 and 10

Figure 5 :
Figure 5: Sexist to non-sexist data ratio for each of the pruning rates and influence scores.
even notice that.I saw she/ and my brain immediately classified that person as retarded and ignored the rest... sexist -5.61337I wonder which of the blonde ones screwed Colin to get that score?Probably both.sexist -5.16273Here are the reasons men get married: She threatens to cut off the pussy.Your family says you ain't a real man.You want to make her haaappy.

Figure 6 :
Figure 6: Proportion of correct and incorrectly classified sexist examples after training for each influence score.

Table 4 :
Top 5 low scoring examples (hard) based on PVI scores.

Table 5 :
Top 5high scoring examples (easy) based on PVI scores.

Table 6 :
Table of Abbreviations contains the hard and easy examples based on their corresponding VoG scores.

Table 7 :
Top 5 high scoring examples (hard) based on EL2N scores.

Table 9 :
Top 5 high scoring examples (hard) based on VoG scores.

Table 10 :
Top 5 low scoring examples (easy) based on VoG scores.