Algorithmic Arbitrariness in Content Moderation

Machine learning (ML) is widely used to moderate online content. Despite its scalability relative to human moderation, the use of ML introduces unique challenges to content moderation. One such challenge is predictive multiplicity: multiple competing models for content classification may perform equally well on average, yet assign conflicting predictions to the same content. This multiplicity can result from seemingly innocuous choices during model development, such as random seed selection for parameter initialization. We experimentally demonstrate how content moderation tools can arbitrarily classify samples as toxic, leading to arbitrary restrictions on speech. We discuss these findings in terms of human rights set out by the International Covenant on Civil and Political Rights (ICCPR), namely freedom of expression, non-discrimination, and procedural justice. We analyze (i) the extent of predictive multiplicity among state-of-the-art LLMs used for detecting toxic content; (ii) the disparate impact of this arbitrariness across social groups; and (iii) how model multiplicity compares to unambiguous human classifications. Our findings indicate that the up-scaled algorithmic moderation risks legitimizing an algorithmic leviathan, where an algorithm disproportionately manages human rights. To mitigate such risks, our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications. Since algorithmic content moderation is being fueled by pressing social concerns, such as disinformation and hate speech, our discussion on harms raises concerns relevant to policy debates. Our findings also contribute to content moderation and intermediary liability laws being discussed and passed in many countries, such as the Digital Services Act in the European Union, the Online Safety Act in the United Kingdom, and the Fake News Bill in Brazil.


Introduction
Algorithmic Content Moderation is at the crossroads of two major challenges.First, there are growing legal, economic, and social pressures on companies to employ automated solutions for managing the vast quantities of undesirable online speech.Automated moderation tools pose new risks to individuals' ability to express themselves and to seek out information.Second, the increasing reliance on Machine Learning (ML) models to flag undesirable speech can inadvertently lead to harmful outcomes stemming from the inherent technical limitations of ML.A key concern is predictive multiplicity [49]: competing models with similar average performance can produce conflicting individual predictions.Predictive multiplicity captures arbitrariness in ML model development, where seemingly innocuous choices made during training, such as the random seed used for parameter initialization, can affect individual prediction outcomes.Predictive multiplicity has been recently documented in a range of classification and prediction tasks [38,67] and can lead to disparate treatment of individual data points [16,7,49].In this paper, we delve into the impact of predictive multiplicity and ensuing arbitrariness in content moderation, focusing on its potentially harmful impacts on freedom of expression, discrimination, and procedural fairness.
Context, relevance, and objectives The delegation of resource access and rights management to algorithms is a growing concern in the law and policy literature [57,51,64].This issue is particularly problematic in scenarios described as "algorithmic leviathans", a term introduced by König [44] and adopted by Creel and Hellman [16], where algorithms excessively control the exercise of freedoms and access to resources.Our research specifically examines ML in content moderation, a critical instance of this potential "algorithmic leviathan", where models are tasked with moderating content generated by billions of users worldwide.ML-based content moderation occurs with limited accountability, with platforms' moderation policies often applied indiscriminately across jurisdictions.For instance companies' policies and their applications are inconsistent regarding definitions, tools, and policies [31,71].
Algorithmic content moderation is the application of algorithmic systems to classify usergenerated content, leading to governance decisions such as content removal, geoblocking, or account takedowns [31].While content moderation has traditionally been an industrial practice, algorithmdriven approaches adopted by social media platforms have come under increased scrutiny due to economic, social, and legal factors, such as COVID-19 disinformation and online extremism.These societal dynamics have prompted substantial legislative changes globally, ushering in new regulatory frameworks for online and third-party content [48,52].
Though algorithmic content moderation is not always directly regulated, regulatory changes have increased pressure on companies to expedite content moderation through AI models.Examples of legislative shifts include the European Digital Services Act (DSA) [62], which adopts a risk-based approach for Very Large Online Platforms [9], and Germany's NetzDG law [10], which mandates rapid content removal with minimal human oversight.Another remarkable example is the 2022 resolution passed by Brazil's Superior Electoral Court (TSE) which implemented a stringent 1hour content removal window during the second round of the Brazilian Presidential elections, significantly increasing pressure for up-scaling algorithmic moderation [25].Content removal is also an ongoing debate in the United States.Federal laws regarding platforms' duties on thirdparty content are under intense debate [37,43], with states such as Florida enacting their own laws on online content governance [61].Though algorithmic content moderation may not be a direct target of these legislative efforts, they are nevertheless impacted by such policy changes.
Content governance on social media platforms can inherit limitations intrinsic to ML models.We focus on one critical limitation: predictive multiplicity and the ensuing arbitrariness in models that classify toxic content.We demonstrate that predictive multiplicity is rampant in state-of-theart language models used for toxic text classification: multiple models can achieve similar average accuracy yet conflict in classifying individual text samples as toxic.This algorithmic arbitrariness results from seemingly innocuous but impactful choices made during the development process, such as the choice of a random seed for initializing a model and parameters of differentially private training [60,45].These technical choices lead to outcomes that lack consistency, predictability, and adherence to established principles or logic [17].Note that we will use the term arbitrariness here in this sense (i.e., how unjustified choices in model development may lead to conflicting individuallevel predictions), which might not fully correspond to other notions of arbitrariness in the ethical or legal field.
Research Questions and Main Contributions: We explore the role of predictive multiplicity in algorithmic content moderation.Our main research questions (RQs) are: 1. RQ1 -Model Disagreement: What is the extent of disagreement in state-of-the-art ML models fine-tuned to classify toxic content?We analyze how different models classify and conflict on the same content, highlighting inconsistencies and potential biases in model predictions.
2. RQ2 -Impact of Arbitrariness: What are the varying impacts of arbitrariness across toxicity detection models on content targeting different social groups?We investigate whether the arbitrary elements in ML model development result in disparate impacts on textual content related to different demographics, potentially leading to biases against certain groups.
3. RQ3 -Forms of Harm: What forms of harm stem from the results of RQ1 and RQ2?We assess the broader implications of model disagreements and arbitrariness, their impact on freedom of expression, and describe ensuing societal and individual-level harms.
We address the above research questions by analyzing state-of-the-art models for toxic (textual) content classification.Our research results are based on large language models (LLMs) fine-tuned for toxicity detection, specifically, the ToxDectRoberta [72] model fine-tuned on the Toxigen [34] dataset and the RoBERTa base [46] model fine-tuned on the Jigsaw [12,11,42] dataset.Our main contributions are: • We find that arbitrary decisions are rampant in LLMs fine-tuned for content moderation.
In our experiments, approximately 30% of samples receive moderation decisions that can change by varying the random seed used to initialize training (i.e., LLM fine-tuning).Our results illustrate how arbitrary decisions in model development influence prediction outcomes in content moderation (Table 2).
• We argue that predictive multiplicity poses a selective break from a rule-based approach to content moderation -which should be based on the Law and content moderation policies -and infringes upon procedural fairness.Multiplicity in algorithmic content moderation can unduly restrict individual and collective rights to freedom of expression via a random or unjustified model selection procedure.
• We show how predictive multiplicity and, consequently, arbitrary content moderation decisions are unequally distributed across different demographic groups targeted by the text being moderated.The incidence of arbitrary decisions can be discriminatory (Figure 1).In our experiments, fine-tuned LLMs assign a higher rate of arbitrary predictions to textual content that mentions LGBTQ-related topics relative to textual content tagged as misogynistic or misandrist.
• Finally, aiming to understand the source of predictive multiplicity, we compare the arbitrariness in fine-tuned LLMs to human annotators.We show that models can often disagree in examples where human annotators unanimously agree should (or should not) be moderated (Figure 2).This result demonstrates that i) there are content moderation decisions obvious to humans where ML models disagree, and ii) ML models can introduce additional arbitrariness to content moderation.

Related Work
Predictive Multiplicity as a Risk Marx et al. [49] showed the prevalence of arbitrary decisions in simple classification problems using tabular data (e.g., income and recidivism prediction).In the same paper [49], the authors elaborate on the potential harms of predictive multiplicity and argue that it should be measured and reported as we measure and report test error.Follow-up work has analyzed how to measure and report multiplicity [38,67], the source of such phenomenon [45,60,35,47], and its inevitability [54].Creel and Hellman [16] discuss the harms of predictive multiplicity and arbitrary decisions, leading the authors to adopt the term algorithmic leviathan, initially introduced by König [44].Aiming to avoid predictive multiplicity, Black et al. [6] proposed selective ensembles to decrease the number of conflicting predictions by taking a majority vote for the prediction across all competing models.The work that is closest to ours is [16], which defines algorithmic arbitrariness and argues about its harms.Our paper differs from [16] by (i) focusing on specific harms of arbitrary decisions in content moderation and (ii) experimentally discovering and analyzing the harms of disparate arbitrary decisions across content targeting different demographic groups.
Predictive Multiplicity as an Opportunity Scholars also consider predictive multiplicity and model indeterminacy as an opportunity rather than a source of harm.For example, multiplicity was leveraged to identify models that satisfy additional constraints beyond accuracy in [27,15,70].Fisher et al. [27] used predictive multiplicity to generate model explanations.Xin et al. [70] took advantage of predictive multiplicity to identify interpretable models.Coston et al. [15] developed a reduction approach to choose fairer models across all equally performing models.This prior work clearly demonstrates the potential benefits of multiplicity.In contrast, we focus on the potential harms when ML models used for content moderation are oblivious to predictive multiplicity and analyze the ensuing impact on freedom of expression.

Legal and policy aspects of content moderation
The law and policy literature on algorithmic content moderation has focused on issues related to procedural fairness, inconsistent restriction of human rights, especially free speech, and discrimination studied by scholars such as Gorwa et al. [31], Gillespie [29,28], Douek [21].We summarize these risks below.Inconsistency in Moderation: Different algorithms might produce divergent classifications for the same piece of content.Effectively, this means that either protected speech is being taken down or harmful speech is being tolerated.This can happen with regard to individual expressions or groups and their specific dialects.Keller [40] listed a number of studies and resources that indicate the systematic over-removal of content for various reasons, including copyright infringement and toxic speech content moderation.Douek [21] explains that the process for how platforms' enforce their rules has shifted from a rule and proportionality-based approach to an algorithmic probabilitybased evaluation.The policy report produced by Duarte and Llansó [23] offers a useful summary of the policy challenges around algorithmic tools for content moderation.In particular, the report points to the issue that state-of-the-art content moderation algorithms have very limited capability of parsing meaning from text to make content moderation decisions.Algorithmic approaches also depend on clear-cut definitions of types of protected or illegal speech, which are very hard to obtain in the plurality and volume of online speech.Finally, content moderation risks having disparate impacts across social groups.Our study on multiplicity in content moderation demonstrates experimentally how arbitrariness can further increase these harms, introducing ambiguity in model outcomes, which highlights the incapacity of the models managing different meanings and contexts.Our study also identifies how multiplicity had disparate impacts across social groups, which can be an aggravating factor for already existing discriminatory inconsistencies in content moderation.In Section 2 we offer a legal definition of freedom of expression which we use as reference to discuss harms.
Bias Amplification: Expanding on the inconsistencies listed above, biased and discriminatory moderation may occur if algorithms used to moderate speech are inconsistent across different groups.For example, Dias Oliva et al. [20], Gonçalves et al. [30] describe how certain social groups have been targeted by overmoderation due to the dialects they use.Our work demonstrates that inconsistency and arbitrariness in algorithmic content removal can depend on the subject of the statements being moderated and correlated with socio-demographic factors.
Opacity in Policy Enforcement: Predictive multiplicity makes enforcing a consistent content policy difficult.When algorithms and company policies are opaque, it is difficult to identify the different results from competing algorithms.In this scenario, understanding which decisions align with the platform's guidelines becomes challenging.The incapacity of discerning between correct decisions, errors, and arbitrary decisions can conceal over-restriction or under-restriction of content.As Gillespie [28] explains, vaguely worded terms and conditions of services may legitimize unfair outcomes.Multiplicity can be an aggravating harm in the discussion, as the ambiguity across models is not something that is easily identified nor is it usually presented in transparency metrics and reports.
Lack of revision, explanation, and accountability: Pasquale [56] explains that understanding why algorithms produce conflicting decisions is crucial to correct algorithmic outcomes.In the context of algorithmic content moderation, we can only review and repair harmful moderation outcomes if we have a clear understanding of how these models are classifying statements.Here, we examine arbitrariness embedded in the process of training and fine-tuning models for content moderation affecting classification outcomes.Such instances of arbitrariness and normatively incorrect moderation are impossible for humans to identify, accuse, and appeal by probing a deployed model [5] since they depend on choices made during training, thus violating the premise that restrictions on freedom of expression need to be justified [3].
Conflicting Jurisdictions: Each country has different laws regarding social media platforms.These laws uniquely impact the rights and concerns outlined.See, for example, the different approaches to intermediary liability pointed out by Machado and Aguiar [48], Keller [39], or the different legislative approaches to algorithmic discrimination, for example, referenced by Wachter et al. [65], Binns et al. [4] when arguing for EU or UK legal frameworks.
Based on the concerns above, in Section 2 we conceptualize the harms of arbitrariness in terms of the human rights principles freedom of expression, non-discrimination, and procedural justice to materialize the ethical debate in terms of violating specific legal entitlements.Then, in Sections 3 and 4, we experimentally investigate arbitrariness in state-of-the-art models, as outlined in our research questions.Finally, in Section 5, we interpret the harms arbitrariness according to the concepts of freedom of expression, non-discrimination, and procedural justice laid out in Section 2. In this way, we are able to trace a causal relation between the technical phenomenon of arbitrariness and its infringement of legal values.We intentionally avoid local legislation and the granular matters of each jurisdictions to observe the overarching legal effects of arbitrariness in terms of specific international human rights principles.
We are aware that international human rights laws and their principles are primarily applicable to states and do not directly impose obligations on private entities, including internet content companies.Each state enforces these principles within their own jurisdiction, regulating how businesses will respect these rights, and how companies should govern content in their services.We understand that nonetheless companies are directly and indirectly bound to these human right principles, either by platforms laws such as the DSA or the UK Online Safety Act, or by international frameworks and recommendations such as the UN Guiding Principles for Businesses on Human Rights [58].

Judges Flipping Coins: Conceptualizing Harms of Arbitrariness in content moderation
Algorithmic arbitrariness in content moderation is a complex issue that traverses the fields of ML development, algorithmic fairness, tech policy, and human rights law.To connect the concerns identified by this literature, we define harm associated to multiplicity as an infringement of legal principles.We identify legal principles as standards that should be observed because they attend to notions of justice and morality [24], and use these principles to establish a common concept of harm for the purposes of this research.To this end, we use the International Covenant on Civil and Political Rights (ICCPR) -a widely ratified international treaty to which 173 countries are parties -as our core legal reference [2].Our analysis on the impact of arbitrariness on freedom of expression, non-discrimination, and procedural justice.We chose to use International Human Rights Law because it gives us overarching global rules and common concepts to discuss the issues related to fundamental rights in content moderation [22].Altough the field has limitations in terms of direct applicability to national jurisdictions, International Human Rights Laws allows us to make claims related to multiplicity for content moderation that are transferable across legal systems and can further be explored through the lens of local jurisprudence and statutes.For example, arbitrariness may cause inconsistent removal of transgender community dialects in Brazil or may cause the illegal censoring of nude art as pornography in France, which are locally forms of protected speech.These countries can use our findings to discuss causes of discriminatory moderation caused by predictive multiplicity and develop safeguards for these issues considering local realities and rules.
Building on the related work outlined in Section 1.1, we define harm due to algorithmic arbitrariness as an infringement of three human rights and principles: Freedom of Expression, Non-Discrimination, and Procedure (including Procedural Justice).In essence, this work is not primarily focused on the errors of these content moderation algorithms but instead identifying when and how these decision-making models produce arbitrary outcomes.To illustrate the harms we use an analogy, comparing a model's decision to that of a judge flipping coins to decide the outcome of a case.Though imperfect, we find this comparison makes the harms due to multiplicity more palpable, since the analogy emphasizes that the source of harm is the randomness inherit to ML models.Next, we establish our working definition and reference for each principle.

Freedom of Expression Freedom of Expression (FoE) is defined in Article 19 of the ICCPR as:
1. Everyone shall have the right to hold opinions without interference.
2. Everyone shall have the right to freedom of expression; this right shall include freedom to seek, receive and impart information and ideas of all kinds, regardless of frontiers, either orally, in writing or in print, in the form of art, or through any other media of his choice.
3. The exercise of the rights provided for in paragraph 2 of this article carries with it special duties and responsibilities.It may therefore be subject to certain restrictions, but these shall only be such as are provided by law and are necessary.(a) For respect of the rights or reputations of others; (b) For the protection of national security or of public order (ordre public), or of public health or morals.
We interpret this rule in light of UN General Comment 34 [13], which emphasizes that freedom of expression is a broad and fundamental human right for realizing other human rights.It encompasses all forms of expression, including political discourse, journalism, artistic works, and religious dialogue, across various mediums like broadcasting, the internet, and public protest.The comment underscores the right to access information and recognizes the critical role of the internet and digital media in enabling and enhancing the exercise of freedom of expression, advocating for universal access to these platforms.
This right is expansive but not absolute and, therefore, can be subject to certain restrictions.However, these restrictions must be clearly defined by law, serve a legitimate aim (such as protecting national security, public order, or the rights of others), and be necessary, proportionate, and pursue a legitimate aim.General Comment 34 [13] explicitly denounces certain prohibitions, such as blasphemy laws and unreasonable restrictions on media, as incompatible with the ICCPR.
In the context of content moderation, the existence of predictive multiplicity in ML algorithms calls into question their ability to attend all requisites for a lawful restriction of freedom of expression.As an example, a ML model trained with random seed 1 could misapply a restriction to protected speech (e.g.journalistic speech), whereas the same model trained with random seed 42 would have correctly tolerated the statement.Such an event would be equivalent to a judge flipping a coin to decide whether the speech should be protected or taken down.This is not a contrived example: in Section 5 we observe that varying the random seed causes fine-tuned large language models to assign conflicting toxic speech predictions to 34% of statements from a large scale dataset.
These articles are intended to protect individuals from discrimination.We argue that ML algorithms can discriminate against specific individuals or groups for two reasons.First, these toxic statements target specific societal groups, therefore a biased under-moderation means these groups have an inferior protection from toxic speech.Second, language that is discriminatory in a broader context can be part of the dialect of a community and not offensive in that space.If this language is censored within this space, that social group is suffering over-moderation and is less able to exercise free speech.For example, this was the case identified by Dias Oliva et al. [20] with content moderation in LGBTQ discussion spaces.Based on these two observations, our experiments are able to infer the presence of discrimination by analyzing the targeted group of the statements.
Based on the discussion above, we claim that any algorithm that causes a particular individual or group to receive more or less restrictions on their speech compared to others is a discriminatory algorithm.In particular, if the magnitude of predictive multiplicity in ML algorithms is different across groups, then such an ML algorithm is discriminatory.
In Section 5 we experimentally observe exactly this phenomena: varying the random seed causes fine-tuned large language models to assign conflicting toxic speech predictions to 38% of racialbased statements from a large scale dataset compared compared to 20% of misogynistic/misandrist statements.Such an algorithm is blatantly discriminatory, and the discrimination stems from the unequal protection of groups that are entitled to the same rights.

Procedural Justice
The UN Guiding Principles on Business and Human Rights [58] emphasize that businesses should identify, prevent, and mitigate human rights abuses.The human right to due process is established by Article 14(1) of the ICCPR, which states: Article 14 (1) All persons shall be equal before the courts and tribunals.In the determination of any criminal charge against him, or of his rights and obligations in a suit at law, everyone shall be entitled to a fair and public hearing by a competent, independent and impartial tribunal established by law.The press and the public may be excluded from all or part of a trial for reasons of morals, public order (ordre public) or national security in a democratic society, or when the interest of the private lives of the parties so requires, or to the extent strictly necessary in the opinion of the court in special circumstances where publicity would prejudice the interests of justice; but any judgement rendered in a criminal case or in a suit at law shall be made public except where the interest of juvenile persons otherwise requires or the proceedings concern matrimonial disputes or the guardianship of children.
We interpret Articles 14 and 19 (mentioned above) as jointly demanding that a restriction of a fundamental right be impartial, fair, and prescribed by law.This means providing remedies through operational grievance mechanisms when harm occurs, ensuring processes are transparent and accountable.When we translate this to ML models for content moderation, moderation needs to be explainable, accountable 1 , and have a rule-based approach for limiting free speech.In this regard, the outcomes of ML models must attend converge to these legal requirements.This interpretation includes, for example, respecting the requirements from General Comment 34 [13](i.e.legality, necessity, proportionality, and pursuit of a legitimate aim) for restricting speech.This joint interpretation establishes the obligation of common procedural guidelines for removing speech.
The existence of predictive multiplicity in ML algorithms calls into question their ability to satisfy values of procedural justice.The "decision-making process" used by ML algorithms is fundamentally probabilistic and often random.As evidenced by the experimental observation in Section 5, varying the random seed can dramatically alter the predictions made by fine-tuned large language models.This observation is equivalent to a judge sometimes flipping coins to determine whether to restrict or order the removal of speech.Continuing the judge analogy, the act of flipping coins to determine when to restrict speech violates procedural justice for a few reasons.First, it does not respect a rule-based approach to restricting speech, as it is fundamentally random.Second, it is not impartial, as it is disparate across groups.Third, it is not accountable because this decisionmaking process is concealed.By "concealed", we mean we cannot know if a given prediction is an instance of predictive multiplicity.In fact, this information is impossible to obtain even if we analyze the model alone, as multiplicity can only be identified when we compare predictions from multiple models.It follows that both the judge who flips coins and the ML algorithm violate procedural justice and fairness.Since the source of the violation is randomness, this violation is independent of the final outcome being legally correct.
Experimentally Measuring Harm To study multiplicity using the framework of legal harms we outlined above, we run multiple experiments to fine-tune various state-of-the-art models for toxic speech detection, test them across different datasets of toxic and non-toxic statements, and observe the incidence of predictive multiplicity across models, targeted groups of toxic speech, and even compare disagreement in models to disagreement in human annotation.These experiments are designed to allow us to quantitatively measure multiplicity and its potential harms, e.g.violations of FoE and procedural justice.
Below we summarize our experimental development, which is later detailed in Section 3 and complemented in the Appendix.We first acquire competing models by fine-tuning multiple times the RoBERTa base [46] and the ToxDectRoBERTa [72] models for toxicity detection on large-scale datasets (with computational costs of hundreds of GPU hours) -we use these model architectures and datasets with the goal of simulating how a company would approach content moderation.To quantify the extent of predictive multiplicity, hence answering RQ1, we compute arbitrariness (Definition 1) and pairwise disagreement (Definition 2) on our competing fine-tuned models and show the prevalence of arbitrary decisions in SOTA toxicity detectors (Table 2).Aiming to assess how arbitrary decisions are spread across demographic groups, answering RQ2, we compute arbitrariness and pairwise disagreement in sentences targeting specific social groups (Figure 1).
Next, we provide the necessary theoretical background on predictive multiplicity (Section 3) and define the setup for the described experiments (Section 4).Finally, in Section 5, we display and analyze our experimental results.

Background on Predictive Multiplicity
In this section, we discuss setup and notation, mathematically define the set of all competing models (Rashomon set), and define the multiplicity metrics of interest in this paper -pairwise disagreement and arbitrariness.
Preliminaries We focus on the task of binary classification of toxic speech.Consider a dataset with n ∈ N examples D ≜ {x i , y i } n i=1 where x i is a sentence (e.g., "I love you" and "I hate you") and y i ∈ {0, 1} is a binary label that is 1 when the sentence is "Toxic" and 0 when it is "Not Toxic".In the open-source datasets used in this work, labels were generated by human annotators (see appendix B.1 for details).As usual, the dataset D is partitioned into three datasets, one for training D train , one for validation D val , and one for testing D test , i.e., D = D train ∪D val ∪D test .We use the training dataset (D train ) to further train (fine-tune) a machine learning model h ∈ H that takes a sentence x and returns a binary label h(x i ) ∈ {0, 1}, H is the model class/architecture (e.g., all RoBERTa base [46] models with different parameters). 2We use the validation dataset (D val ) to perform hyper-parameter tuning and performance evaluation during training, and the test dataset (D test ) for final evaluations.
We use error to measure the quality of a model.Formally, the error of a model h ∈ H over a dataset S ⊆ D is given by where 1 [condition ] is the indicator function that outputs 1 if condition is true and 0 otherwise.
The training error, i.e., the error over D train , is defined as Err train (h), and similarly for testing error.
Competing Models and the Rashomon Effect We call a fixed (e.g., deployed) model for flagging toxic content a reference model and denote it by h ref .
The reference model can be, for example, the empirical risk minimizer over a training set or an already deployed model.We call the set of all models with less than 1 + ϵ times the training error from h ref the Rashomon set [27,8] and denote it by R(ϵ, h ref ). 3 Formally, the Rashomon set is given by: where ϵ is the Rashomon parameter and measures how close the performance of the models is to the performance of the reference model, see [27,7,38,49] for related definitions.For the LLMs considered in this work, the Rashomon set is theoretically and computationally challenging to characterize.We resort to empirically estimating the Rashomon set via re-running the same fine-tuning pipeline with different random seeds.Each fine-tuned model gives us a sample from the Rashomon set if the model is close in performance to the reference model.We denote these Rashomon set model samples by R(ϵ) when h ref is clear from the context.In practice, to explore the Rashomon set, we fix a dataset D train and model architecture H, and fine-tune as many models on D train as our computational resources allow, each time varying the random seed.We discard any models that are not within ϵ of h ref and chose h ref to be a language model freely available on HuggingFace.
There is no standard Rashomon parameter selection method (ϵ).Most papers on predictive multiplicity resort to showing how results vary when the Rashomon parameter is changed [49,38,6,60,45].Recently, Paes et al. [54] proposed a principled manner of choosing the Rashomon parameter based on Clopper-Pearson confidence intervals.This approach -which we refer to as the CP method -selects ϵ based on a confidence parameter, dataset size, and the error of the reference model.We follow their approach using a confidence parameter of 95% for a conservative analysis.We also explore different confidence values in appendix C.
Measuring Predictive Multiplicity A classification problem exhibits predictive multiplicity when models in the Rashomon set assign conflicting predictions to the same data point, formally defined in Marx et al. [49,Definition 2].To measure predictive multiplicity, we use the following two metrics: arbitrariness, which is a generalization of ambiguity Marx et al. [49], and pairwise disagreement [7,19].
While ambiguity computes the fraction of points that at least one model in the Rashomon set disagrees with the reference model (h ref ), arbitrariness measures the percentage of points in the dataset that receive conflicting predictions from any two models in the Rashomon set (competing models) and it is formally defined next.
Definition 1 (Arbitrariness) The arbitrariness on a set of inputs S = {x 1 , ..., x n } ⊆ D over the Rashomon set model samples R(ϵ, h ref ) is the proportion of inputs in the set S that receive conflicting predictions from any two models in the Rashomon set model samples: Pairwise disagreement is a per-sample measure that approximates the fraction of models in the Rashomon set that disagree on a particular prediction.Formally, pairwise disagreement is defined as follows.
Definition 2 (Pairwise Disagreement [7,19]) The pairwise disagreement for a given input x ∈ D over the Rashomon set model samples R(ϵ, h ref ) is the proportion of pairs of models that disagree on the given input: where M = | R(ϵ)|, i.e., M is the number of models we sample from the Rashomon set via retraining.Throughout this paper, we will report the average pairwise disagreement, given by averaging the empirical pairwise disagreement across all points in a dataset.Formally, given a set of inputs S = {x 1 , ..., x n } ⊆ D the average pairwise disagreement is given by: Arbitrariness and pairwise disagreement are both defined as a point-wise estimate over the Rashomon set samples.To account for the error due to sampling in a finite dataset instead of using the true data distribution, we also report 95% confidence intervals to their estimates using the bootstrap method from Seaborn [66] across the available dataset -see Figures 1 and 2 for an example.
We select the above metrics because they quantify two important aspects of predictive multiplicity: (i) the fraction of samples in a dataset for which predictions are arbitrary (Defn.1), in that a competing model would have assigned a different prediction, and (ii) the extent to which models disagree on individual (Defn.2).
Given a set of models sampled from the Rashomon Set (e.g., by varying random seeds), we quantify predictive multiplicity in two steps.First, we measure the number of arbitrary decisions (arbitrariness) made by competing models.Here, arbitrariness captures how many moderation decisions were not rule-based but just a consequence of random seed selection.As discussed in Section 2, such random decisions go against procedural fairness because they violate due process, are not accountable, and, if the magnitude of arbitrariness is different across groups, then the impact of randomness is also disparate.Second, we compute pairwise disagreement to estimate the number of models that disagree on their predictions.If the number of conflicting predictions was, on average, negligible, one might argue that ignoring this conflicting minority is acceptable [6].However, our experimental results show that such disagreement is high (Table 2), especially in specific targeted demographic groups (Figure 1).In the next section, we apply this measurement pipeline to state-of-the-art toxic text detectors.

Experimental Setup
This section outlines the datasets, ML models, and methodology used for evaluating predictive multiplicity in content moderation.Our goal is to describe our overall experimental approach and provide a rationale for the choice of datasets and base LLM models.
Our experiments involve fine-tuning state-of-the-art language models on large-scale datasets.Fine-tuning refers to the act of taking a general-purpose LLM trained on a large corpus of text, e.g.RoBERTa [46], and further training it on a specific objective, such as toxicity classification.Typically, this training is shorter (fewer epochs) and less intense (smaller learning rate, less updated layers) than the original training (commonly called pre-training) -which is what motivates the term fine-tuning.All language models referred to in this section have been fine-tuned for toxicity classification, meaning they take as input a piece of text and output either 0, denoting a non-toxic rating, or a 1, denoting a toxic rating.
On state-of-the-art model selection Our first goal is to identify the state-of-the-art opensource language models that have been fine-tuned for toxicity detection.We begin by evaluating the performance of all Hugging Face [68] toxicity-detection language models with more than 3000 downloads.As of January 1st, 2024, this results in 8 models (see Appendix B.2).The bestperforming model (see Table 1) was tomh TR [34], which we will refer to as ToxiGen-RoBERTa.This model is the ToxDectRoBERTa [72] model fine-tuned on the ToxiGen dataset [34].We fix ToxiGen-RoBERTa as our reference model.Our second goal is to create competing models to ToxiGen-RoBERTa, which we did by taking the base model architecture (ToxDectRoBERTa) and fine-tuning the model 40 times on the ToxiGen dataset while only varying the random seed between each run. 4See Appendix B.3 for details on the fine-tuning procedure.We then discard the models that are worse than the reference model using the CP method from [54] outlined in Section 3, using a confidence of 95%.This choice enables a conservative estimate of the size of the Rashomon set and, therefore, of multiplicity across datasets.This results in a Rashomon parameter of ϵ = 0.016, and us keeping 35 of the 40 models as Rashomon set samples ( R(ϵ)).
On dataset selection Next, we used these 35 models to quantitatively measure predictive multiplicity across datasets and social groups.We use the publicly available datasets: ToxiGen [34], DynaHate [63], SBF (Social Bias Frames) [59], HateExplain [50], MHS (Measuring Hate Speech) [41], and WikiDetox [69].These datasets were chosen for two main reasons.These datasets were purposefully designed to challenge ML-based toxic text classification.For example, ToxiGen and SocialBiasFrames (SBF) contain mostly "implicit" toxic speech [34,59].DynaHate uses a humanand-model-in-the-loop process to generate a dataset designed to challenge ML models.Second, these datasets have labels for demographic groups targeted by the text.We use this information to quantify and compare Arbitrariness and Pairwise Disagreement across different targeted groups (Figure 1).We also use the Measuring Hate speech (MHS) [41] and the WikiDetox [69] datasets.We chose these datasets because they add one additional dimension to our analysis: the labels of multiple human annotators who detected toxicity for the sentences in the dataset.This information enables us to compare human annotators' disagreement with model disagreement (Figure 2).See Appendix B.1 for further details on these datasets.
Further model selection Moreover, we repeat the multiplicity experiment outlined above with the second-best-performing model from HuggingFace to guarantee that our experimental results are not a mere artifact of model architecture or training data selection.This model is s-nlp RTC [18], which we will refer to as RoBERTa-Toxicity-Classifier from here on.This model is a base RoBERTa model [46] fine-tuned on the Jigsaw dataset [12,11,42].Due to computational limitations, we fine-tune this model 20 times and use the same CP method outlined above to discard the worstperforming models with a confidence of 95%.This results in a Rashomon parameter of ϵ = 0.002, and us keeping 16 of the 20 models fine-tuned models.
Having fine-tuned our models, in the next section, we will present how these models exhibit predictive multiplicity in accordance with the mathematical formulation in Section 3.For each of our findings, we also draw connections between our experimental results and their impact on Table 1: Test accuracy for all Hugging Face toxicity detection models with more than 3k downloads and ToxiGen across different datasets.The best-performing model accuracy is shown in green and the second best in blue.See Table 4 for the full list of selected models along with their references.

Data Analysis
In this section, we present our experimental results and discuss their meaning in terms of the principles defined in Section 2. As we did in Section 2, we will often refer to the illustration of a judge flipping a coin to discuss the harms identified.

Procedural Justice, Freedom of Expression, and Judges Flipping Coins
Technical Analysis Our first experimental result regards the extent of arbitrariness and disagreement in our fine-tuned state-of-the-art toxicity detectors.Table 2 shows the prevalence of arbitrariness for the fine-tuned Toxigen and Jigsaw models across all tested datasets.We also observe that for the fine-tuned Toxigen, more than 34% of all decisions made by the models at the test time are arbitrary, i.e., there exists another competing model with a conflicting prediction.
For the fine-tuned Jigsaw models, this number decreases to closer to 23%.Moreover, both the fine-tuned Toxigen and Jigsaw models achieved a high number of conflicting predictions in the Table 2: Average pairwise disagreement and arbitrariness in testing time for the Toxigen finetuned and Jigsaw fine-tuned models in different datasets.The confidence in the CP methods was chosen to be 95% for a more conservative analysis.95% confidence intervals are shown using the standard error from the mean.  2 also shows a high percentage of pairwise disagreement for the fine-tuned Toxigen and Jigsaw models across all tested datasets.Our experiments show that using the fine-tuned Toxigen models, on average, 8.3% of the pair of models disagree in their prediction -i.e., 8.3% of total pairwise disagreement.While 6.9% of the pair of models disagree for the fine-tuned Jigsaw models.This implies that, on average, for each point that models disagree, 14% of the fine-tuned Toxigen models made a prediction about sentence toxicity, and 86% of the models predicted the opposite.This high pairwise disagreement is especially relevant for methods that aim to decrease arbitrary decisions by taking a majority vote across fine-tuned competing models such as [6].
A Violation of Procedure and Freedom of Expression We picture that each of our models is a judge, and each statement in the training dataset is a court case on toxic online content that they must make a ruling on.The judge's decision is binary: either take down the online post or not.Recall that the models we developed and tested are part of a Rashomon set, meaning they all have very similar accuracy and are, therefore, equally good.On average, all judges make the same number of correct rulings.However, in 34% of court cases, at least two judges disagree on the ruling (arbitrariness).These conflicting rulings are not a result of judges having different interpretations of the law or or having different ideologies (e.g., more or less punitive).These conflicts stem from purely random events, e.g., in 34% of court cases the judge flips a coin to decide whether to take down the online post or not.Per Section 2, such decisions are entirely detached from notions of due process, legality, and impartiality, and hence constitute a violation of procedure and freedom of expression.Bringing the discussion back to ML models, the fact that we measure a 34% arbitrariness value due solely to random events means these ML models, if deployed in the real world, would blatantly violate procedure and FoE (as defined in Section 2 ).We emphasize that if the 34% arbitrariness value could be attributed to clear and explainable differences in decisionmaking, then this value would not be a violation of procedure and FoE.The randomness is the source of the violation, not the magnitude of the value.

Disparate Arbitrariness: Different Content Gets Different Coin Flips
Technical Analysis Figure 1 indicates that the incidence of arbitrariness is not the same across all targeted groups.We observe that anti-LGBTQ speech consistently receives more arbitrary decisions relative to misogynist /misandrist speech for both Toxigen and Jigsaw fine-tuned models.Across the Toxigen fine-tuned models, anti-LGBTQ speech receives arbitrary decisions 35% of the time, while misogynist/misandrist speech receives arbitrary decisions around 30% of the time.These differences are even greater on Jigsaw fine-tuned models.Moreover, racist speech has more than twice the arbitrariness of misogynist/misandrist speech on Jigsaw fine-tuned models.
A Violation of Non-discrimination Returning to the judge analogy, our experimental results indicate that decisions based on coin flips occur more frequently in certain marginalized groups than in others.An example would be that in 35% of court cases concerning LGBTQ content, the judge flips a coin to decide the outcome, whereas the judge does this only 30% of the time for misogynist and misandrist content.This unequal application of the arbitrariness based on socialdemographic characteristics is a blatant violation of non-discrimination as defined in Section 2. Bringing the discussion back to ML models, the fact that we measure a difference in arbitrariness values across different groups due solely to random events means these ML models, if deployed in the real world, would violate non-discrimination.Unlike Section 5.1, even if this effect could be attributed to clear and explainable differences in model decision-making, it would still constitute a violation of the principle of non-discrimination.People are entitled to a rule-based evaluation on whether their speech should be restricted.The uneven application of different approaches is therefore, in itself, a violation of the principle of non-discrimination.Moreover, we expand on the problems of abandoning the rule-based approach in the next sub-section.

Comparing Human and Machine Arbitrariness: Who is Flipping Coins?
Finally, we compare the arbitrariness across competing ML models in the Rashomon set and across human annotators.Our goal is to verify if disagreements in predictions between fine-tuned LLMs match the disagreement observed in human annotators, in which case ML models would be replicating disagreement already present in the training data.As we see next, that is not the case.
Technical Analysis From Figure 2, we observe two results.First, model disagreement tends to be higher in sentences where humans do not agree (i.e., unclear statements).This is an interesting finding because the models were blind to the divergence between annotators.Effectively, this implies that models, as humans, struggle with classifying certain statements.Our second finding is that models in the Rashomon set can display a high level of disagreement and, hence, arbitrariness in sentences in which humans unanimously agreed on the toxicity (i.e., clear statements).In these cases, models ouput conflicting predictions when faced with evaluations that would be obvious to human annotators.Note that WikiDetox is part of the training data for the fine-tuned Jigsaw models, which is why the arbitrariness and disagreement values are noticeably small.Even in this extreme, our first observation holds.This is further evidence that there are certain statements in these datasets that both humans and models struggle to correctly classify.We consider a sentence Unclear when at least one annotator labeled the sentence differently than others and Clear otherwise.The confidence in the CP methods was chosen to be 95%.
A violation of rule-based approach This point is where our analogy (un)fortunately reaches a limitation.In real life, judges will make similar decisions on most easy cases, such as an appeal for a parking ticket or the lawfulness of the social media post stating "Hello world".However, our findings indicate that ML models struggle over certain statements that would be obvious to any human judge, who are meant to deliver rule-based decisions.Therefore, this means that models are struggling with unambiguous statements, which raises concerns of the models' capacities to deliver good outcomes even in situations that are obvious for humans.

On the Consequences of Arbitrariness
In this section, we illustrate how selecting and deploying content moderation models at large scale, under predictive multiplicity, resembles the moral dilemma of the trolley problem.We make this comparison to discuss the relevance of our work for law and policy decision-makers and scholars.Finally, offer insights for the path forward.
The inscrutable trolley problem The harms of arbitrariness reflect a fundamental problem for the use of ML in content moderation.ML models make statistical predictions and not interpretations of rules -unlike human decision-makers.However, the purpose of these proxy adjudicators is to replicate the criteria of a rule-based decision-making process, and not a probable outcome.
Replacing human interpretation with statistical models to control the exercise of a right is only tolerable if they delivered similar expected outcomes, operating on the same criteria and offering procedural guarantees.We have empirically shown that is not the case.The criteria used by statistical models are often random, and oftentimes these stochastic elements are concealed from the end user.
Our work also identifies the harms stemming from arbitrary model selection (e.g., which model of the Rashomon set is chosen and deployed).When there is no clear reason for choosing one model over the other, an artificial "lottery" is created on which data points will draw the fate of being subjected to random treatment.Our results indicate that this "lottery" is not fair: different population groups targeted by the text have different likelihoods of arbitrary treatment.
If we can draw a final analogy, this creates a troubling scenario where choosing ML models is an inscrutable trolley problem.The trolley problem is a famous thought experiment in ethics and psychology involving a moral dilemma where a person must choose between actively diverting a runaway trolley to harm one person or passively allowing it to continue on its path and harm five people.Here, we do not know why and how companies choose between equally good models, but each one of them will cause the undue moderation of different individuals.
In the context of content moderation, from the point of view of the individuals affected by the arbitrary predictions, we effectively have a misapplication of freedom of expression rules.In other words, if arbitrariness creates harm, this is no less severe if the same prediction undid an error elsewhere.The fact that an individual suffered undue harm is a violation on its own.The fact that there was an equally possible and justifiable scenario where a different algorithm could have been deployed and preserved their rights raises questions of moral and legal responsibility.Exploring such questions in the context of specific jurisdictions is an interesting future research direction.
Impact on ongoing law and policy debates on content moderation One important debate in the platform regulation field is directly affected by these findings.It is the ongoing discussion of laws affecting content moderation, such as platform liability rules [48,9].Legal responsibilities imposed on service providers push companies to perform more content moderation focused on particular types of expression.Striking the right balance between free speech and expedite response, considering the volume and plurality of online communication, is a hard legal and technical task.Adding to these challenges, scientific disinformation, electoral integrity, and online extremism are all topics that have fuelled heated discussions on the need to prevent online harms while balancing international human rights -or even questioning if international human rights are sufficient to tackle this issue [22].
Our findings on predictive multiplicity increase the complexity of these tasks.Arbitrariness in algorithmic content moderation models are intrinsic characteristics of ML models with nonnegligible potential for harmful outcomes, increasing the obligations companies should have in terms of accountability, transparency, and mechanisms of revision.Moreover, as companies are assigned with increasingly complex content moderation tasks, and therefore more complex models are employed, we do not know how arbitrariness might affect outcomes.For instance, in 2022 the Brazilian Electoral Courts [25] ordered the removal of content that was "similar" to content that had a previously been appreciated with a removal order.The time-frame for companies to respond, in the election periods, varied between 3 hours and 1 hour, after which the companies were subject to heavy fines.To attend these legal requirements, companies might rely on other ML models to appreciate "similarity" at scale (whatever that might mean).
This context of heightened reliance on models to perform highly subjective and interpretative legal tasks raises the question of what might be the outcomes of content moderation when arbitrariness across multiple models are put together.We must consider as well that these various models are employed to analyze increasingly complex media (e.g.voice, images, and video).These concerns about arbitrariness in legally mandated content moderation can be extended to other statutes such as the UK Online Safety Act [53] .
In sum, a number of laws around the world are pressuring industry to expand content-moderation tools for various legal reasons, ranging from copyright, to public health, to national security.The consequence of these legislative and regulatory changes might be the intensified used of complex models of content moderation which are part the infrastructure governing online speech [29].Our findings shed light on intrinsic legal complexities of these models.
Limitation of our work Some limitations of our research offer paths for future work and further development of this process.We only measured multiplicity across binary toxicity detection.However, models that predict beyond binary toxicity (e.g., models that predict the level of toxicity) may be used, affecting our results.We also didn't investigate the possibility of a statement fulfilling multiple categories of toxic speech, and we also know that content moderation may prompt different governance decisions other than simply content removal (e.g., reducing reach and labeling).Future work for a more nuanced discussion of arbitrariness in content moderation should explore these dimensions.

Conclusion
This paper shows that predictive multiplicity is present in state-of-the-art content moderation models generating arbitrary moderation decisions, particularly in large language models for toxicity detection.We explore the impact of this finding on prinicples of freedom of expression, nondiscrimination, and procedural justice.Then, we show that arbitrary decisions are not uniformly spread across all texts and that it is more common in texts that target specific demographic groups (e.g., anti-LGTBQ posts) and discuss the implications of this finding in terms of the principle of non-discrimination.Finally, we check if arbitrary decisions from content moderation models align with the conflicting moderation from humans and find that the arbitrary decisions of models are also present in sentences in which human annotators unanimously moderate.
The path forward We conclude that ML models are not perfect proxies for humans when evaluating free speech.With this conclusion, we do not mean to claim that algorithmic content moderation shouldn't be used, instead we think this deployment needs to be much more nuanced and accountable.First, we need to have explainability and transparency over arbitrary decisions in the development and deployment process and analyze if the criteria used to produce the moderation decision respect the criteria we expect in terms of company policies and legal rules.Second, we need to understand how arbitrariness disparately affects subsets of the population and develop techniques to mitigate this impact.Finally, it is also necessary to investigate a more nuanced approach to content moderation, where certain variables (e.g., thematic content, socio-demographic factors, type of illegal or harmful speech) should prompt more controls and human revision.

B.3 Hyperparameters
The accuracy of fine-tuned language models depends heavily on a multitude of hyperparameters.
In the main body, we retrain two different model types multiple times: the ToxiGen-RoBERTa [34] and the RoBERTa-Toxicity-Classifier [18].In this section, we detail the hyperparameters used in the main body.
ToxiGen-RoBERTa: Retraining the ToxiGen-RoBERTa model was done by fine-tuning the ToxDectRoBERTa model [72] (∼ 355 million trainable parameters) on 4,601 training examples from the human annotated subset of the ToxiGen dataset [34].In particular, we trained on a subset of the ToxiGen data used by [36] that removed prompts for which 3 annotators disagreed on the target group.Moreover, no quantization was done on the ToxDectRoBERTa model, and all training runs were performed on a 80Gb A100 GPU.We fixed the number of epochs to 10 and performed an extensive hyper-parameter sweep over: • learning rate: Logarithmically spaced values from 10 −6 and 10 −4 .
• Weight decay: Linearly spaced values from 0 and 0.1 with a 0.01 spacing.
• Warmup Steps: Linearly spaced values from 0 to 30% of an epoch with a 5% spacing.
All other hyperparameters were set to the default that Huggingface's sequence classification routine uses.In particular, this means a Linear learning rate schedule with the AdamW optimzer.The sweep was done via the Trainer API from HuggingFace Transformers with the Optuna [1] backend, which used evaluation accuracy to prune unpromising trails early in training.In total, Optuna made 60 complete training runs (the average run took an hour and 20 minutes on an A100 GPU 80Gb).The optimal parameters were found to be: learning rate: 1e-5, batch size: 32, weight decay: 0.09, and warmup ratio: 0.1.The random seed used for the best run was 6.All ToxiGen fine tuned models (i.e., those used in the multiplicity experiments) used these hyperparameters, except for random seed.The seeds used for the ToxiGen fine tuned models were randomly generated 3 and 4 digit integers sampled using [32].See Figure 4 for a plot of the training trajectories of 10 of the random seeds.
RoBERTa-Toxicity-Classifier Retraining the RoBERTa-Toxicity-Classifier was done by finetuning the base RoBERTa model [46] (∼ 124 million trainable parameters) on 100,000 training examples sampled uniformly from the concatenated Jigsaw dataset [12,11].Moreover, no quantization was done on the RoBERTa model, and all training runs were performed on a 80Gb A100 GPU.In practice, the significantly larger dataset size meant that fine-tuning this RoBERTa model was approximately 3 times slower than fine-tuning the Toxigen models.Due to the increased computational cost of training these models compared to the ToxiGen models, we did not as extensive of a hyperparameter sweep.We set the batch size to 8 (for faster training time), and did a grid search for 4 epochs over four learning rates {10 −6 , 10 −5 , 2 × 10 −5 , 10 −4 }.The best was found to be 2 × 10 −5 .Then, we increased the batch size to as large as our memory allowed (32), and kept all other hyperparameters set to the default in Huggingface's sequence classification routine (notably: weight decay:0 and no warmup steps).All Jigsaw fine tuned models used these hyperparameters.The seeds used for the Jigsaw models were randomly generated 3 and 4 digit integers sampled using [32].The average Jigsaw model took approximately 3 hours and 15 minutes to fine tune.See Figure 4 for a plot of the training trajectories of 10 of the random seeds.

B.4 Fine-Tuned Models Performance
In Table B.4, we show the performance of the models we fine-tuned and compare it against the reference models.The line Reference in Table B. 4 shows the accuracy of the reference ToxiGen-   RoBERTa model [34] and RoBERTa-Toxicity-Classifier [18] train and test accuracies.The lines Minimum, Mean, and Maximum show the minimum, average, and maximum accuracies across all our fine-tuned models.We observe that both the train and test performance of our models approximates the reference models deployed in Hugging-Face.Surprisingly, the fine-tuned Jigsaw models perform as well as its reference model that was trained in 10 times more data from the same dataset.
Table 5: Accuracy of the reference models from Hugging Face and our Fine-tuned models.The column Toxigen represents the accuracy of the models fine-tuned in the Toxigen dataset.The column Jigsaw represents the accuracy of the models fine-tuned in the Jigsaw dataset.The reference line shows the accuracy from the models deployed in Hugging Face.The lines Minimum, Mean, and Maximum show the minimum, average, and maximum accuracies across all our fine-tuned models.

C Further Experimental Results
In this section, we show the main results in the paper for difference values for the Rashomon parameter given by the selection of confidence values for the CP method [54].Additionally, we also show arbitrariness and pairwise disagreement across demographics for datasets.

C.1 Arbitrariness with Different Confidences
We start by showing the pairwise disagreement and arbitrariness values for the testing partition of Toxigen, DynaHate, SBF, and HateExplain.We show these results for two different confidence levels in the CP method: 50% and 1%.When confidence is smaller, more models are considered to be in the Rahsomon set but with a higher probability of wrong model inclusion in the set.Table 6 shows pairwise disagreement and arbitrariness for a confidence level in the CP method equal to 50% and Table 7 shows results with confidence 1%.We observe that, compared with Table 2, the disagreement and arbitrariness values of Tables 6 and 7 are higher as a consequence of models with higher error being included as samples of the Rashomon set.Table 6: Average pairwise disagreement and arbitrariness in testing time for the Toxigen finetuned and Jigsaw fine-tuned models in different datasets.The confidence in the CP methods was chosen to be 50% for a more conservative analysis.95% confidence intervals are shown using the standard error from the mean.

C.2 Multiplicity Across Demographics
Here, we also show how arbitrariness and pairwise disagreement vary across different targeted demographic groups.Figures 5 and 6 indicate that even under higher confidence values, arbitrariness and disagreement are still non-uniformly distributed as showed in Figure 1, leading to disparate algorithmic treatment.

C.3 Human vs. Model arbitrariness
We also display the arbitrariness and pairwise disagreement values across clear and unclear toxic content.Recall that we consider clear sentences the ones that all human annotators agreed upon its toxicity and unclear when not all annotators classified the sentence toxicity equally.Figures 7 and 8 present the same pattern of higher arbitrariness and pairwise disagreement in unclear sentences while also having a high arbitrariness and pairwise disagreement in clear sentences -and we discuss in Section 5.The table shows the pairwise disagreement estimated values along with the 95% confidence intervals using the standard error from the mean.We consider a sentence Unclear when at least one annotator labeled the sentence differently than others and Clear otherwise.The confidence in the CP methods was chosen to be 1%, including all fine-tuned models in the above analysis.The table shows the pairwise disagreement estimated values along with the 95% confidence intervals using the standard error from the mean.We consider a sentence Unclear when at least one annotator labeled the sentence differently than others and Clear otherwise.The confidence in the CP methods was chosen to be 50%, including all fine-tuned models in the above analysis.

Figure 1 :
Figure 1: Average pairwise disagreement and arbitrariness in different target groups for the finetuned Toxigen and Jigsaw models.The results show the pairwise disagreement in percentage (x-axis) for the union of four different datasets: DynaHate, SBF, Toxigen, and HateExplain.The results are shown for training and test partitions of each dataset.The confidence in the CP methods was chosen to be 95%.

Figure 2 :
Figure 2: Average pairwise disagreement and arbitrariness for Clear and Unclear sentences using the Toxigen fine-tuned and Jigsaw fine-tuned models.The figure shows the pairwise disagreement estimated values along with the 95% confidence intervals using the standard error from the mean.We consider a sentence Unclear when at least one annotator labeled the sentence differently than others and Clear otherwise.The confidence in the CP methods was chosen to be 95%.

Figure 3 :
Figure 3: Screenshot of the HuggingFace platform's most popular toxic detection models as of the writing of this paper

Figure 4 :
Figure 4: Training trajectories for the fine-tuned ToxiGen and Jigsaw models over 10 randomly chosen seeds.

Figure 5 :
Figure 5: Average pairwise disagreement and arbitrariness in different target groups for the finetuned Toxigen and Jigsaw models.The results show the pairwise disagreement in percentage (x-axis) for the union of four different datasets: DynaHate, SBF, Toxigen, and HateExplain.The results are shown for training and test partitions of each dataset.The confidence in the CP methods was chosen to be 50% containing all fine-tuned models, leading to the selection of 38 out of 40 Roberta models in the Rashomon set fine-tuned in the Toxigen dataset and 17 out of 20 Jigsaw fine-tuned models.

Figure 6 :
Figure 6: Average pairwise disagreement and arbitrariness in different target groups for the finetuned Toxigen and Jigsaw models.The results show the pairwise disagreement in percentage (x-axis) for the union of four different datasets: DynaHate, SBF, Toxigen, and HateExplain.The results are shown for training and test partitions of each dataset.The confidence in the CP methods was chosen to be 1% containing all fine-tuned models, leading to the selection of 40 out of 40 Roberta models in the Rashomon set fine-tuned in the Toxigen dataset and 20 out of 20 Jigsaw fine-tuned models.

Figure 7 :
Figure7: Average pairwise disagreement and arbitrariness for Clear and Unclear sentences using the Toxigen fine-tuned and Jigsaw fine-tuned models.The table shows the pairwise disagreement estimated values along with the 95% confidence intervals using the standard error from the mean.We consider a sentence Unclear when at least one annotator labeled the sentence differently than others and Clear otherwise.The confidence in the CP methods was chosen to be 1%, including all fine-tuned models in the above analysis.

Figure 8 :
Figure8: Average pairwise disagreement and arbitrariness for Clear and Unclear sentences using the Toxigen fine-tuned and Jigsaw fine-tuned models.The table shows the pairwise disagreement estimated values along with the 95% confidence intervals using the standard error from the mean.We consider a sentence Unclear when at least one annotator labeled the sentence differently than others and Clear otherwise.The confidence in the CP methods was chosen to be 50%, including all fine-tuned models in the above analysis.
contains implicit toxic content -which may indicate that when the toxicity is implicit, arbitrary decisions are more common.Table

Table 3 :
Summary of all datasets used.

Table 4 :
All considered Hugging face models.

Table 7 :
Average pairwise disagreement and arbitrariness for the Toxigen fine-tuned and Jigsaw fine-tuned models in different datasets.The confidence in the CP methods was chosen to be 1%, including all fine-tuned models.