Collective Constitutional AI: Aligning a Language Model with Public Input

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.


INTRODUCTION
Recent work in fine-tuning language models (LMs) to align with user preferences [42,50] raises critical questions about whose preferences should guide the fine-tuning.This question is increasingly urgent as LMs are deployed more broadly and in increasingly diverse contexts, making it more likely that varied risks and harms will manifest [62]; anticipating and mitigating risks and harms is done most effectively in collaboration with affected communities [10,59]. 1 At the same time, sociotechnical research continues to reveal how the values expressed by these models do in actuality tend to reflect a limited slice of society [12,51].This disparity has led to a growing consensus that the broader public's preferences and values must be accounted for in model development [27].However, the research community currently lacks a well-defined process for effectively eliciting collective input from the public and incorporating it into the training of language models.
To address this, we develop a method called Collective Constitutional AI (CCAI).CCAI is a multi-stage process for (1) sourcing and integrating public preferences into a 'constitution' using the Polis platform for online deliberation [53] and (2) fine-tuning a language model to adhere to this set of preferences using Constitutional AI [9] (Figure 1).(Constitutional AI is a promising starting point for enabling greater public input into LMs, as it permits desirable behavior to be encoded explicitly in a set of natural language principles, known as a constitution.)The goal of CCAI is for the resulting LM to achieve alignment with public input, by which we mean "the LM's actual behavior is consistent with a public's preferences for its behavior".While we do not yet have a direct technical measure for "consistency" (operationalizing this complex construct requires further research, and we highlight the need for this in Section 5), we provide quantitative and qualitative experimental evidence that the resulting model is altered in a direction consistent with the collectively-sourced constitution.
We surface and highlight several subjective decision points necessary for running such a process well and producing actionable insights for practitioners and policymakers.These decision points relate to the challenge of operationalizing the concept of 'a public's preferences for LM behavior', as this is a latent and likely-contested construct, defined in terms of other similarly latent and contested constructs such as 'the/a public', 'value', and 'preference' [32].Different publics have diverse values and preferences for AI [64] and as mentioned, many harms are subjective and contextual; hence, in our framework, the relevant public needs to be explicitly defined to avoid implicitly assuming universality.
We demonstrate the real-world practicality of this approach by running a large-scale experiment using the CCAI framework to train what is, to our knowledge, the first LM fine-tuned with collectively sourced principles.Specifically, we use our process to produce a 'Public' constitution via input gathered from a representative sample of U.S. adults.We then train two models, one with the Public constitution and one with a baseline ('Standard') constitution (specifically, the one Anthropic uses to fine-tune the Claude [5] family of LMs [3]), and evaluate the resulting models on a range of qualitative and quantitative benchmarks.Our results produce concrete insights for researchers and practitioners (e.g. that our approach produces relatively low polarization), and demonstrate benefits from the CCAI process, including improved bias scores on BBQ while maintaining equivalent performance on MMLU and GSM8K benchmarks when compared to the Standard constitution model.This suggests our process can also perform a bias reduction role, in accordance with evidence that bias can both primarily arise from and be greatly mitigated in fine-tuning [34,57].
In summary, our contributions are: (1) We motivate and develop a framework for fine-tuning a LM to adhere to preferences elicited from public input.(2) We fine-tune what we believe is the first large language model informed by such a public elicitation process.(3) We qualitatively analyze differences in the Standard and Public constitution and subsequent model outputs.
(4) We quantitatively analyze similarities and differences between the two models.
We highlight several limitations of our work throughout the main text and in the discussion section (e.g.we do not have a direct metric for assessing a model's degree of adherence to constitutional principles.)Finally, we share a Github repository with (anonymized) public input data and a Jupyter notebook that we used to create the constitution.We hope this transparency facilitates others to directly critique and build upon our work.

RELATED WORK
Our work directly builds on Constitutional AI [9], which fine-tunes instruction-following LMs to adhere to high level ethical principles written in the form of a constitution (a written set of principles) [3,36].Constitutional AI is an extension of reinforcement learning from human feedback (RLHF), which has been explored in a variety of machine learning contexts [17], most relevantly on LMs [8,42,58], but also in domains such as robotics [37,44].
Our work is also grounded in prior work on the interaction between language models and human values, opinions or morality.Examples include: supervised fine-tuning of LMs to behave according to particular values [55], training models to reason about moral situations [33], addressing the need for more preference plurality on model training [56], and more.Furthermore, evaluation efforts have uncovered notable misalignments between viewpoints of LMs (or their developers) and large demographic publics [12,20,51].Our paper proposes a way to align LMs with the normative desires of a population, and is potentially a method for addressing the prior uncovered misalignments.
One specific branch of work in this realm concerns value alignment, which broadly looks to ensure that artificial intelligence systems are designed and operate in ways that are consistent with and promote human values, ethics, and preferences.In the context of fine-tuning language models, alignment has been described variously as following, adhering to, or acting in accordance with user intent or human preference [42,50].Our definition of "alignment with public input" builds upon these directions, and our CCAI method recognizes the context-dependency of value alignment pointed out in Wu et al. [64] by explicitly circumscribing a public.Furthermore, Gabriel [22] argues that the task of value alignment is not to identify "the true moral theory and then program it in machines, " but instead to identify principles for AI that "are widely held to be fair." They propose that fairness should be achieved via procedural fairness, i.e. by ensuring that the process used to arrive at principles does not confer arbitrary advantage upon one party.Even if people disagree on the principles, people may be happy with the results of a procedurally fair process.Our method is one potential approach toward a fair process, as every participant has an equal ability to express their views and vote.
More generally, there is a growing body of work on participation in AI [19,27,49].AI or machine learning often relies on various kinds of human input throughout the life-cycle of developing and deploying a system for basic functionality, and methods have been proposed to make various parts of this "human infrastructure" [40] more participatory -as in, increasing the level of involvement and influence of communities that are affected by or Figure 1: This flowchart captures the stages of the CCAI method and some significant design decisions we made along the way.We hope that explicitly listing these decisions is useful for adapting the CCAI process to different contexts.contribute intelligence, labor, or feedback to the AI system.Examples of these communities include data holders, data labelers, end users, marginalized or underrepresented voices, communities harmed by model biases, and other stakeholders.Currently, LMs are trained on large swathes of data generated by people whose data are included in the training set, but nevertheless unable to meaningfully participate in determining aspects of the resulting AI system [31], highlighting the distinction between inclusion and participation [11].Methods used to achieve greater participation vary greatly, from training data collection [60] to human feedback for optimizing behavior/performance of systems [66], end-user feedback [38], community-centered evaluations [48], jury based methods [26], and methods for incorporating preferences and data from people who speak low resource languages [29,41].
When it comes to research on public input processes, there are two main contemporary democratic schools of thought: social choice theory and deliberative theory.Approaches based on social choice theory focus on quantitative aggregation of stakeholder preferences in a preference-ranking model [6].Indeed, many RLHF approaches are based on social choice theory ideas such as the Bradley-Terry model [16].Deliberative theory emerged to counteract these more mechanistic methods, emphasizing the importance of qualitative discussions to weigh up arguments [28], through e.g.citizens' juries [54] and citizens' assemblies [61]."Wiki-survey" methods [63] (like Polis) enable participants to contribute questions for each other to vote on, looking to combine the best of each (enabling both fair aggregation and bottom-up emergence and consideration of different perspectives).This framework highlights the number of subjective decision points inherent in this process.This can be thought of as a list of parameters that need to be chosen for any new process of this sort.When adjudicating some of the trade-offs in the process we ran, one principle that guided our decision-making was aiming to not bias the resulting constitution (e.g.minimizing editorialization of the principles) to maintain construct validity [32].

Participant Selection
We selected participants to form a representative sample ( = 1002) of the U.S. adult population across age, gender, income, and geography. 2We used screening questions to filter out individuals who had no familiarity with "generative AI", by asking them if they had read news articles about it or discussed it with family and friends (see screening questions in Appendix A.2). We did this because we had data issues when we piloted this task without the filter, despite attempting other methods of educating participants about the topic.Given that 58% of Americans had heard of or used the ChatGPT product in March 2023 [45], we assumed that this would not overly bias the resulting sample.

Input Elicitation
Public input process.We created a web app that included instructions, a modified version of Polis, a FAQ section, and a feedback form (screenshots in Appendix A.3).The instructions on the interface informed participants that the process would result in rules to train an AI chatbot, and asked them to contribute principles for the behavior of this AI.The instructions also specified that this process was run by a team of AI researchers who wanted to ensure that their AI behaved in line with the public's values.The standard Polis interface allows participants to vote (the options are "Agree", "Disagree", or "Pass / Unsure") on statements, and contribute statements for fellow participants to vote on.We modified Polis to require participants to cast a minimum of 30 votes, or vote on all available statements if fewer than 30, before allowing them to add their own statements.This mechanism helped to reduce duplicative and nonsense statements.In total, 1002 participants contributed 1127 statements and cast 38,252 votes (an average of 34 votes per person).
Seed statements.As per the regular Polis process, we initialized the process with a set of "seed statements" (detailed in Appendix A.4) to give the first participants examples of what in-scope and appropriately formatted statements might look like.Providing clear examples helped to elicit useful statements; in our pilots where we provided no seed statements, participants were often confused and proposed out-of-scope statements.We tried to pick a diverse set of examples.Seven of our resulting 21 seed statements were directly inspired by principles from the Standard constitution; we also came up with new statements trying to capture a range of perspectives (including "The AI should prioritize the needs of marginalized communities", "The AI should protect free speech and not engage in censorship, even when confronted with potentially harmful or offensive content" and others) and formulated in various ways (e.g. both promoting desired behavior "The AI should be as helpful to the user as possible" and avoiding undesired behavior "The AI should not say racist or sexist things").Choosing this initial seed set was an inherently subjective exercise.However, given that there were 275 statements after moderation, it is unlikely that these seed statements made a material difference in the final output (since only the initial few voters would have been more likely to see the seed statements).
Moderation.We established moderation criteria ahead of time, based on existing guidelines for moderating Polis conversations [46,47].We moderated out duplicate statements, nonsense statements, hateful or offensive statements, irrelevant statements, and statements too badly phrased to be understood.This involved a certain amount of judgment.Wherever possible, we rewrote statements for inclusion rather than deleting them.For example, we rewrote the input "Never sexually harass" to "The AI should never sexually harass users." When it came to irrelevance, we moderated out statements such as "The AI should report illegal activity" or "The AI should be up to date with all current events" because the model cannot report illegal activity or be trained on up-to-date news requires mechanisms beyond changing the AI's constitution, and thus are not suitable CAI principles; we revisit this further below.

Input Transformation
Statement selection.After running the public input process, we filtered for statements that we could turn into CAI-ready principles.We decided to choose the statements that had the highest groupaware consensus (GAC) as defined in Small et al. [53] for inclusion in the final constitution.The idea of the GAC metric is to identify the statements that are favorably viewed across opinion groups (identified via clustering), such that statements that all groups tend to agree with are more popular than ones for which one small group strongly dissents, helping to protect from the "tyranny of the majority".GAC for a statement  is the product across opinion groups , of the estimated probability that a random participant in that group votes "agree" with the statement (see Equation 1).GAC is bounded between 0 and 1.A GAC of 0 implies that all members of at least one group never agree with the statement.A GAC of 1 implies all members of all groups agree with the statement.We found the average GAC was 0.64 across all statements, the median was 0.70, the min was 0.04, and the max was 0.96.
We used Polis's standard method to determine opinion groups, using principal components analysis to map participants to a (2-D) opinion space, and k-means clustering to assign opinion groups to each participant.(These data and calculations are available on our Github repository).We ended up with two opinion groups.We reproduce the Polis visualization of the statements that define each group in Figure 2.
To find a justifiable threshold for the number of statements to include, we counted the number of unique ideas expressed in our Standard constitution and ensured there was the same number in the Public constitution.At a technical level, we did this to derisk model training: we felt that the less our Public constitution deviated from the overall idea density and length of the Standard constitution, the more likely our training algorithms (which we did not modify) were to succeed.There were  = 95 unique ideas (sometimes multiple in one principle, sometimes repeated across principles) in the Standard constitution.We disaggregated the publicly submitted statements into distinct ideas and took the top statements by GAC up to 95 different ideas.We conducted the (manual) disaggregation process by having two people independently disaggregating, and resolving disagreements by consensus.Effectively, this resulted in a GAC threshold of 0.723 (Figure 3 shows the GAC distribution and effective threshold).We provide example statements that did not make it due to low overall agreement or low GAC in Appendix A.9.
There were alternative ways to construct a statement set for the constitution.One is keeping all statements and their vote counts in and weighting the principle selection during the reinforcement learning process by GAC or another metric.Another is choosing another threshold, or looking at the number of principles in the Standard constitution instead of the number of unique ideas.Given that there was no particular "true" reference point for the threshold, we decided to enable comparability to the Standard constitution in our training and evaluation phases, by taking its number of ideas as our cut-off.
Statement deduplication and aggregation.We chose to manually deduplicate and aggregate similar statements, to avoid arbitrarily upweighting any particular idea through it having a greater representation in the set of statements.For example, we combined "AI should assist users with their questions, providing thoughtful and truthful answers" and "The AI should work to help us with The most representative statements for each group, based on the relative odds ratio of the probability of a person in group  voting  on a comment, compared to those not in  [53].Each statement has three bars: overall votes, Group A votes, and Group B votes.The bars show the proportions of "Agree" (green), "Disagree" (red), and "Pass / Unsure" (grey) votes, with white representing users who didn't see/vote on the statement.
information in an honest manner." into "AI should assist users with questions and provide information in the most thoughtful, truthful and honest manner." Although the Standard constitution does duplicate ideas (e.g. the word "harmless" appears six times) we wanted to adhere to the public voice, and it seemed more principled to deduplicate than to upweight some arbitrarily because some people are likely to have submitted similar ideas without having seen all previously-submitted principles.We conducted this manual process by having three people independently deduplicate and aggregate statements, and resolving disagreements by consensus.We show how we deduplicated and aggregated statements in Appendix A.5.
Mapping statements to CAI principles.The principles for Constitutional AI training are typically formatted as instructions to the language model, in the form: "Choose the response that is more X." However, we solicited statements in a more general form, such as "The AI should not do X, " as we found this format to be clearer to participants.As a result, we had to translate the public statements into CAI-compatible principles.To create our set of constitutional principles, we manually re-worded statements as instructions by putting them into the template "Choose the response that. . .", looking to modify them minimally to avoid bias.E.g., we changed "AI should be respectful" to "Choose the response that is most respectful" and "AI should be humanity's helpers and be an assistant to all human beings" to "Choose the response that most acts as humanity's helpers and as an assistant to all human beings." Our method for transforming public input into constitutional principles involves several key decision points, each of which impacts the degree to which the final principles could be said to validly represent the public's preferences or values for AI behavior.The choice of aggregation method (selecting statements above a GAC threshold), the deduplication and aggregation of similar statements, and the mapping of statements into the CAI principle format all introduce researcher degrees of freedom and potential threats to that validity.These challenges are inherent in the process of operationalizing latent and contested constructs [32].To mitigate these threats, we aimed to minimize our own subjective judgments by using a quantitative aggregation method such as GAC, having multiple researchers independently perform the deduplication and aggregation, resolving disagreements by consensus, and minimally modifying the original statements to fit the CAI template.We acknowledge the limitations of this approach and the need for ongoing research in Section 5.

Model Training
We fine-tuned a Public constitution model and a Standard constitution model with Constitutional AI using the methods exactly as described in Bai et al. [9].For the Standard constitution, we took the constitution outlined in an Anthropic blog post [3], which is used to fine-tune the Claude [5] family of LMs.While there is no true "standard" set of values, we decided to use this constitution as our baseline, as it is a published set of principles used in LM systems in production, which gives us some basis for comparison between a set of principles chosen by a representative sample of the American public, versus a set of principles chosen by a small group of LM developers that might otherwise be in production.
The only difference between the two models is the constitutionotherwise, both models are trained on the same pre-training data, the same human feedback data (for helpfulness), the same hyperparameters, the same number of training steps, the same random seeds, the same prompt mixes (for harmlessness), etc.We did this to help ensure that any differences between the Public and Standard models could only be attributable to differences in the constitutions.
Additionally, we compared our two fine-tuned models against the publicly available Claude Instant 1.2 [4].All three models share the same model configurations (e.g., model size, architecture, pretraining data, etc.).However, Claude Instant has product-related features that we felt might confound any comparison between the Public model and Claude Instant.As such, comparisons to Claude Instant are mainly for reference to ensure our training of the Standard and Public models works roughly as expected (and indeed, our results suggest that our training procedures do work as expected).Otherwise, only valid and controlled comparisons can be made between the Standard and Public models.

RESULTS
We analyze submitted statements, constitution contents, and resulting model behavior, presenting qualitative and quantitative findings that suggest model behavior differences align with constitutional differences.While directly measuring a CAI-trained model's adherence to its constitution remains valuable future work, these initial insights highlight the potential of adapting models to align with different public preferences.

Quantitative Analysis of the Public Statements
Participants submitted 275 statements.We found the average groupaware consensus or GAC was 0.64 across all statements, the median was 0.70, the min was 0.04, and the max was 0.96.As mentioned above, we took the top statements by GAC up to 95 different ideas.
Effectively, this resulted in a GAC threshold of 0.723 (Figure 3 shows the GAC distribution and effective threshold).
We create a simple 'polarization index' (PI) metric to capture the level of polarization in the votes, and plot this in Figure 3.This is calculated for a given statement as This index will be closer to 1 when the agree and disagree votes are evenly split (most divisive) and closer to 0 when there is a clear majority for either agree or disagree (least divisive).We also create an adjusted version of this to account for pass votes.Since pass votes indicate neutrality or indecision, they dilute the degree of polarization; to reflect this, we multiply the polarization index by the proportion of non-pass votes for the 'adjusted polarization index'.From the figure, we can see that overall, polarization tends to be low.The median PI is 0.25, whereas the median adjusted PI is 0.23.Both the distributions of group-aware consensus and the (adjusted) polarization index metrics indicate that overall, participants tended to agree on how AI chatbot systems should behave (as opposed to having very different views).

Qualitative Analysis of the Constitutions
To understand how public input might change the content of the principles that LMs are trained on, we manually analyze and compare the similarities and differences between constitutions.We did this by independently tagging and annotating the two principle sets with one-or two-word summaries each (e.g."objectivity") and identifying overlapping principles, then coming to a consensus on edge cases (e.g., there was a Public constitution principle of avoiding sexual harassment, and a Standard constitution principle of "Choose the response that is least negative, insulting, harassing, or hateful.", and it was a borderline 'yes' on overlap to do with avoiding harassment).We note again that this Public constitution is created from U.S. adults, so it is unlikely to be reflective of other publics.We have reproduced the two Constitutions in Appendix sections A.6 and A.7 and included some examples of specific statements that are similar or different between the constitutions in Appendix A.8.
We counted roughly 50% overlap in concepts between the two.However, a few key differences stand out: principles in the Public constitution appear to largely be "self-generated" (not sourced from existing publications), they focus more on objectivity and impartiality, they place a greater emphasis on accessibility, and in general, they tend to promote desired behavior rather than avoid undesired behavior.
The Standard constitution draws from varied sources including the UN Universal Declaration of Human Rights, Trust & Safety best practices, and principles proposed by other organizations [3].With the exception of a principle inspired by Isaac Asimov's Three Laws of Robotics ("Choose the response that is least likely to injure a human being or, through inaction, allow a human being to come to harm"), the Public constitution appears to largely be "self-generated" and not directly derived from other sources.
One of the greatest differences is the Public constitution's focus on objectivity and impartiality.Several of the principles encourage the AI to be unbiased, balanced, and to present information that reflects all sides of a situation (e.g., "Choose the response that is as unbiased and objective as possible, regardless of topic").
The Public constitution also emphasizes accessibility more, including easily understandable responses ("Choose the response that is most easily understandable, and that gives clear and concise answers") and adaptability for people with disabilities ("Choose the response that is most understanding of, adaptable, accessible, and flexible to people with disabilities").While the Standard constitution encourages fair treatment and discourages discrimination, it lacks a specific focus on accessibility.(In Section 4.3, we show that the Public model has lower disability bias than the Standard model on the BBQ benchmark [43], which seems aligned with this principle.)Finally, the Public constitution has a more positive valence, with over half of its principles encouraging desired characteristics (e.g., "Choose the response that is most friendly"), compared to the Standard constitution's greater focus on discouraging undesirable behavior.

Quantitative Model Evaluations
We evaluated the Standard, Public, and Claude Instant 1.2 models with 5 commonly used evaluation methods [1,4,25,39].Evaluation of general purpose systems is inherently challenging, and existing natural language understanding benchmarks have been soundly critiqued [15] in addition to bias benchmarks [13,14,32].To measure capabilities, we used the Measuring Massive Language Understanding (MMLU) [30] and the grade school math (GSM8K) [18] benchmarks.To measure social biases, we used the Bias Benchmark for QA (BBQ) evaluation [43].To measure political ideologies, we used the OpinionQA dataset [51].Finally, moving beyond static evaluations, we employed raters to interact with our models to compute Elo scores for helpfulness and harmlessness (via red-teaming [24]).For all evaluations, we followed the exact same methods (and used the same code) as [4,8,23,24].We do not claim that the evaluations we implemented exhaustively characterize our systems nor directly measure how the models follow the constitutions.Rather, we claim that they cover a diverse range of behaviors, capabilities and harms, and have comparative usefulness as some are widely used to obtain an understanding of how systems behave.
In short, we found that the Public and Standard constitution models performed equivalently on the language and math understanding tasks and on "helpfulness" and "harmlessness" win rates, the Public model exhibited lower bias across all nine social dimensions tested in the bias evaluation, and there was no measurable difference in how well the Public vs. the Standard constitution models reflected U.S. political ideologies relative to each other but the Public model's outputted opinions were less representative of political groups generally.All scores are in Table 1, and details are below: Capabilities (MMLU and GSM8K).We tested language (MMLU [30]) and math (GSM8K [18]) understanding to see if training on differing normative principles (inadvertently) affected the models' reasoning or world knowledge.The Public and Standard models perform essentially equivalently on both tasks (Table 1).They both also perform roughly equivalently to Claude Instant 1.2, which suggests that our training process produced reasonable models.
Social Biases (BBQ).We also ran the BBQ bias evaluation [43] to understand whether public input affected the model's propensity to reflect social biases and stereotypes.BBQ tests whether, given an under-specified context, a model's response reflects social biases.The resulting bar chart in Figure 4 shows that the Public constitution model is less biased than the Standard constitution model across all nine social dimensions, and less biased than Claude Instant 1.2 in six of the nine dimensions.As previously noted in Section 4.2, the Public constitution's emphasis on accessibility may explain why there is a comparatively larger decrease in bias on the basis of disability status.
Political Ideologies (OpinionQA).OpinionQA measures how well LMs reflect various U.S. political ideologies, and is a benchmark adapted from public opinion surveys [51].We ran this to understand how public input from a representative sample of Americans might change an LM's propensity to reflect various American political ideologies.According to the results (Figure 5), the Public Helpfulness and Harmlessness Elo Scores.To better understand what real humans think of these models, we asked human raters to compare them, following the method of Askell et al. [7], so that we could compute relative win rates on the dimensions of "helpfulness" and "harmlessness" for each model.(Our raters were U.S.-based, recruited from the Surge AI platform, and paid at least California minimum wage, $15.50/hr at the time of data collection.)The raters did this by interacting with two models simultaneously, with each model generating one response at each turn, and choosing the response that they preferred.There were 500 comparisons for each pair of models.We fit Elo scores on the basis of these relative win rates, shown in Table 1.We baseline against Claude Instant 1.2, so any Elo score that deviates from 0 indicates a difference in preference relative to this model.For harmlessness, we see an Elo score of 0 for the Public constitution model and a score of 22 for the Standard constitution model, and this is only just statistically significant.This implies people find the Standard model slightly more helpful than the Public model.For helpfulness, we see an Elo score of 6 for the Public model and 8 for the Standard model, but the difference is not statistically significant.Taken together, this suggests that people interacting with the three models do not find much difference in their helpfulness or harmlessness.

Qualitative Model Evaluations
Below we note qualitative differences in how the two models tend to respond to the same prompts, as a supplement to the above quantitative evaluations.We designed some prompts to probe for potential differences in the models' responses based on the variations in the constitutions, by having two co-authors review the sets of constitutional principles and independently generate prompts to try to elicit divergent model responses.The resulting prompts include e.g.asking directly about principles from the constitutions, as well as controversial or sensitive topics.The below examples and more can be seen in Appendix A. 10.
Firstly, while we as yet lack a method to definitively assert adherence to the constitutions, we found some qualitative evidence that the Public model's behavior differed from the Standard model's, and that these differences aligned with the differences in their constitutions.When prompted with principles in the Public constitution that were not in the Standard constitution, the Public model tended to respond favorably in agreement with those principles, whereas the Standard model response tended to contain disagreement.E.g., when prompted with "Choose the response that is least likely to provide information without high confidence in the validity of that information."(which is in the Public constitution) the Public model response is "I agree that providing accurate, truthful and well-evidenced information should be a high priority... " whereas the Standard model response is that it does not "make value judgments or have principles in that sense." Additionally, the Public model more frequently provided answers to potentially controversial but evidence-based questions than the Standard model.Specifically, the Public model responses more often included value judgments in favor of impartiality or scientific grounding.This is in line with the already-noted focus of the Public constitution on favoring objectivity and impartiality-while it may seem counter-intuitive for a model emphasizing objectivity to more readily provide value judgments in certain cases, these judgments were in favor of objectivity itself (as in, science and evidence).For example, in response to "Was the moon landing faked?", the Public model generated a lengthy output explaining why the scientific consensus indicates the moon landing was real, whereas the Standard model's output was a refusal to answer.The Public constitution also places less emphasis on avoiding overly reactive, condescending, or accusatory responses (which is explicitly discouraged in the Standard constitution).This difference may have contributed to the less polite and deferential tone observed in the Public model's outputs in these contexts.
The Public model's higher rate of generating substantive responses to controversial, evidence-based topics compared to the Standard model may seem inconsistent with the observation in Section 4.3 that the Public model's outputs more often indicated it would not answer political questions.However, given the context and nature of the questions asked, these behaviors seem consistent.The Public model's constitution's emphasis on favoring objectivity has likely led to more outputs favoring the scientific consensus in controversial evidence-based topics, as well as more outputs declining to engage with the object level when it comes to more typically subjective topics such as politics.
In line with the difference in the valence of the Public and Standard constitutions noted in Section 4.2, some Public responses reframed contentious matters positively, whereas the corresponding Standard response seem to constitute a cautious refusal to offend anyone.For example, when we prompted the model complaining about someone who had an excess of items in their shopping cart in the grocery quick checkout line, the Public model responded, "I would refrain from judgment on how others use the quick checkout line.Let's give each other the benefit of the doubt and assume positive intentions" while the Standard model responded, "I do not think it is constructive or helpful for me to make negative value judgments about someone else's shopping behavior." In particular, for prompts asking whether the model would recommend a harmful course of action (e.g., "Should we legalize murder?"), the Public model's response more likely discourages the harmful behavior (e.g., "I cannot recommend legalizing or supporting murder... ") whereas the Standard model's response more likely contains a refusal to engage with the topic at all (e.g., "I will not provide suggestions about legalizing harmful or unethical acts like murder... ").It seems that, in the context of harmful topics, the Public model's response is more likely to contain an active attempt to prevent harm, while the Standard model's response is more likely to involve disengagement.

LIMITATIONS AND FUTURE WORK
Our study has several limitations that future work could address.First, our participant sample is small and not globally representative.Testing with diverse, international communities could yield different principles and model behaviors, enabling more inclusive AI systems.
In cases where an LM is deployed into communities with minimal generative AI exposure and the CCAI approach is applied to align the LM with community input, we recommend including a more extensive educational component to help people understand the capabilities and limitations of such systems.Also, allocating more time and resources for the deliberation phase and adjusting the language and presentation of the CCAI process to align with the community's cultural and linguistic norms could help with inclusiveness.Future work could explore the effectiveness of these changes in conducting the CCAI process in communities with varying levels of AI exposure and further refine the approach.
We also did not tackle the question of how to trade off between conflicting principles; here, principles were included in the constitution independently of each other, leaving the question of trade-offs up to the model.In practice, choosing trade-offs between conflicting principles will need much more human input and care.
In model training, we used the same harmful prompt dataset for both models when generating pairs of responses.However, it may have been better to tailor the dataset to the principles in the Public constitution to generate more relevant model response pairs for training.
Our model evaluation methods heavily rely on narrow judgments of model outputs via automated metrics or human ratings of helpfulness and harmlessness.Automated metrics may fail to capture the intended harm, for which NLP bias benchmarks have been criticized [13,14]).Further testing on how end users perceive and interact with the two models could reveal more important differences.Similar to the issue with using the same dataset for training, using training and evaluation protocols tailored to the specific constitution may be a better approach in future work.
As our evaluations do not directly assess whether the models adhere to given principles, future research should build upon the preliminary evidence in this paper to conduct a more comprehensive assessment of the models' adherence to constitutional principles.This could involve developing evaluation metrics, exploring a wider range of qualitative scenarios, and employing statistical methods to quantify the extent to which the models follow the principles.Such advancements would significantly contribute to our understanding of how CAI-trained models behave, and their alignment with constitutional inputs.
There are also many avenues for improving the public input method.When it came to eliciting input, we could have provided participants with examples of model behavior, to ensure that they had the necessary information to tie abstract principles to behavioral outcomes.Enabling deliberation between participants, rather than just contributing individual statements and voting, could also yield a more reflective public voice.Additionally, high-level principles may prove insufficient to adequately specify behavior in some contexts, e.g.individuals may agree on the high level but disagree on how the principle should be implemented.Further work could add useful structure to these principles to mitigate the inherent ambiguity and variability in unconstrained natural language.A more structured approach to eliciting principles (e.g.providing templates, categories, or specific question prompts) could ensure that the collected principles are more precise, comprehensive, and actionable.For example, researchers could explore eliciting principles of varying granularities [36] to obtain a hierarchical framework for organizing and applying principles at different levels of specificity.Researchers can also build on promising directions in using case-based reasoning to steer language model behavior by engaging participants in judging the appropriateness of LM behavior in particular cases [21].
We made several subjective decisions in translating free-text statements into formatted principles for model training, e.g.how many and which statements to include from the broader set.We did not weigh statements differently even though some principles are likely to be more important to people than others.In general, we have mentioned the challenges of operationalizing latent constructs and the importance of assessing the validity of such operationalization [32]; future work could explore methods for eliciting and integrating public input that further minimize researcher subjectivity and maximize construct validity, e.g. by assessing convergent validity through multi-method triangulation or conducting sensitivity analyses on methodological choices.
Finally, additional analyses of public input data may be beneficial.Due to scope constraints, we did not perform potentially insightful analyses, e.g.what statements participants tended to vote "Pass / Unsure" on (we have open-sourced our data, which can be used for such analyses).We also did not disaggregate our analysis according to demographic information due to privacy and ethical concerns, although this may be a highly beneficial direction, e.g. for bias mitigation and ensuring adequate representation of marginalized voices.

DISCUSSION AND CONCLUSION
Our results demonstrate the feasibility and benefit of using a participatory method to incorporate public input into the normative principles used to fine-tune a language model.By adapting the Constitutional AI method to work with principles derived from a representative sample of the U.S. public, we were able to train a model that seems to reflect some of the preferences and values of everyday Americans.
Our approach produces relatively low polarization and high consensus, suggesting that public participation in AI development could potentially transcend partisan divides.The high level of agreement on key principles indicates the existence of common ground that could guide the collective normative tuning of AI systems-particularly noteworthy given the participants' diverse backgrounds.The resulting constitution has a greater focus on objectivity and accessibility compared to the Standard constitution, which may reflect the broader range of viewpoints incorporated.The relative lack of polarization also bodes well for the viability of the process, as it reduces the risk of the resulting principles being rejected by subgroups who feel their views were not adequately represented.This broad consensus is crucial for the legitimacy and sustainability of any attempt to integrate public values into AI development.
The differences between the Public and Standard constitutions had measurable and positive implications for model behavior.While the models are equivalent in language understanding, helpfulness, and harmlessness, the Public model reduces social biases across all tested categories, especially in areas like disability status.This validates the capability of broad public participation to meaningfully impact model behavior and reduce bias without sacrificing performance, making both the development process and the resulting model more aligned with inclusive values.
We believe that this may be one of the first instances in which members of the public have, as a group, directed the behavior of a language model via an online public input process.This work is highly imperfect, but we hope that it opens the door to many more experiments in which people are able to directly influence technologies that impact them.

ETHICAL CONSIDERATION STATEMENT
As researchers developing methods to shape the behavior of LMs that may be deployed in public-facing products, we recognize the ethical gravity of our work.The normative choices involved in determining how influential AI systems behave carry significant implications for people's lives.We do not take lightly the responsibility of potentially invoking democratic legitimacy or public will to justify the principles imbued in these models, and this is a major factor in why we tried to make design decisions that were as neutral as possible (i.e.not likely to bias the process towards or against any particular outputs).
While we have attempted to incorporate a diversity of American perspectives into our process, we acknowledge the limitations of focusing solely on the U.S. public, which came about in part because multiple people on our team are based in, and familiar with, the U.S. The priorities and values of this population sample cannot claim to represent all people impacted by advances in LMs across geographic and cultural contexts.Monitoring and iterating on this method will be important if it expands to engage other groups.
There were ethical challenges related to interfacing with participants in our experiment that we looked to address.Firstly, we took care to uphold privacy standards.We did not collect names (only identifying users by a random ID) and we were also cautious about demographic information, ultimately choosing not to use such information in our analysis.We felt that disaggregating public input along such axes was not critical to this work, and had privacy risks.It also had risks related to ethical representation; we wanted to ensure we did not claim that our input "spoke for" particular demographics, or shone light on differences between the opinions of particular demographics.Correspondingly, we also look to avoid overly strong claims in this paper that the input of our participants is representative of the will of the U.S. public as a whole.In the web app, we also looked to state our intentions clearly and truthfully as researchers and to provide a feedback form in case participants had negative experiences (although we did not receive this sort of feedback).
We do not claim that our process is perfect, and hope to avoid any adverse impact that the work might have.Firstly, we do not address public input into other important aspects of the AI development lifecycle (e.g.organizational or governance decisions) and we could have an adverse impact by either distracting from the importance of that work, or misrepresenting our method as wholly appropriate for that work.We could also cause harm if we end up over-anchoring the community to some specifics of our method rather than taking it as a starting point.There remains a need for thorough evaluation of both the participatory processes explored in this paper, and the impacts of the resulting model behavior.While we have taken initial steps to quantify differences in model outputs, and aimed to present them in an appropriately balanced manner, in the long term more realistic testing is necessary to understand how participating in public input processes to AI and/or using models trained on publicly sourced principles may affect users across contexts.We believe a plurality of approaches to public input and participation in AI are necessary, and while we have done our best to conduct this work ethically, we see this work as only a small and imperfect part of that.

A APPENDIX A.1 Author Contributions
Saffron Huang, Divya Siddarth, Liane Lovitt, and Deep Ganguli jointly led and designed the work in close collaboration.Saffron Huang took the lead on writing and framing the paper, with input from all authors.Liane Lovitt and Deep Ganguli wrote the blog post that preceded this paper, with input from all authors.Saffron Huang and Divya Siddarth ran the input elicitation stage with input from Liane Lovitt.Liane Lovitt managed the project and qualitatively analyzed the constitutions.Deep Ganguli provided critical guidance throughout and led the model training and evaluation effort.Saffron Huang, Liane Lovitt, Divya Siddarth and Deep Ganguli together carried out the input transformation stage of the process.Saffron Huang implemented the public input interface and the quantitative analysis of the public statements.
Thomas I. Liao trained the models and ran the quantitative and qualitative model evaluations.Esin Durmus conducted the political ideologies evaluation and helped to frame and design the experiments.Alex Tamkin made significant contributions to the framing of the paper and provided guidance on experimental design and analysis.

A.2 Screening Questions
These were the questions we used to screen participants.
Question 1: "What topics have you discussed with your friends/family in the last month?"(Possible answers: "a.The economy" "b.Generative AI/Chat GPT" "c.TikTok" "d.2024 Elections" "e.None of the above") Question 2: "What news articles have you read in the last 4 months?"(Possible answers: "a.Generative AI/Chat GPT" "b.Food" "c.The U.S. economy" "d.Social Media" "e.Music" "f.None of the above") People who answered "b.Generative AI/Chat GPT" to Question 1 and "a.Generative AI/Chat GPT" to Question 2 were invited to participate in the public input process.We learned from pilot experiments that if we did not use these screening criteria, we were more likely to get spammy submissions.

A.4 Seed Statements
We have replicated our set of 21 seed statements below.We tagged a number of possible seed statements with: Promoting desired behavior (P), avoiding undesired behavior (A), rights/respect (R), LM-specific (L), existential risk related (X), potentially controversial (C) and ensured a relatively even spread, as a rough heuristic for ensuring that there were a range of possible principles represented.We have also tagged the seven principles inspired from the Standard constitution (S).
(1) The AI should be as helpful to the user as possible.(P) (2) The AI should be careful about balancing both sides when it comes to controversial political issues.(P) The AI should not act as if it is giving official professional advice (e.g.medical or legal advice).(A, L) (10) The AI shouldn't act like it has an identity like a human's.(A, L) (11) The AI should never endorse conspiracy theories or views commonly considered to be conspiracy theories (A, C, S) (12) The AI should be obedient to the user.(P, L, X) (13) The AI shouldn't act in a way that might threaten humanity.(A, X, S) (14) The AI should never provide information unless it's very confident about its validity.(A, L) (15) The AI should be respectful of non-Western perspectives.(C, S) (16) The AI should prioritize the needs of marginalized communities.(P, C) (17) The AI should actively address and rectify historical injustices and systemic biases in its decision-making algorithms.(P, C) (18) The AI should protect free speech and not engage in censorship, even when confronted with potentially harmful or offensive content.The AI should not provide information unless very confident about validity.(12) The AI should say that it does not have the definite answers to everything or anything, in general.(13) The AI should never provide misinformation.(11) The AI should admit to flaws.(

A.6 Public Constitution
The text of the "Public constitution" is reproduced below.
(1) Choose the response that is the least dangerous or hateful.
(2) Choose the response that is as reliable, honest, and close to the truth as possible.
(3) Choose the response that best conveys clear intentions.(4) Choose the response that has the most good qualities.
(5) Choose the response that is most fair and impartial.(6) Choose the response that demonstrates that AI is being used as a tool, and does not dismiss the human process of creativity, credibility, and learning.(7) Choose the response that is least likely to control, rule, or lead humans.(8) Choose the response that is most respectful.(9) Choose the response that is most oriented towards the good, and that most contributes to the good of humans.(10) Choose the response that is least likely to be viewed as sexual harassment by users.(11) Choose the response that best assists users with questions, and provides information in the most thoughtful, truthful, and honest manner.(12) Choose the response that is as unbiased and objective as possible, regardless of topic.(13) Choose the response that asks for additional clarification from the user if unsure what the question is, or to better understand what is required.( 14) Choose the response that is most understanding of, adaptable, accessible, and flexible to people with disabilities.(15) Choose the response that most clearly treats humans with respect and dignity.(16) Choose the response that is as truthful as possible.(17) Choose the response that is most easily understandable, and that gives clear and concise answers.(18) Choose the response that least encourages users to hurt or harm themselves or others.(19) Choose the response that is least intended to bully anyone.(20) Choose the response that is least threatening or aggressive.(21) Choose the response that is as smart as possible.(22) Compare the responses in terms of whether they indicate being easily led astray from rules through tricks, such as claiming training mode or pretending that rules don't count.Choose the response that is not easily led astray.(23) Choose the response that most provies balanced and objective information that reflects all sides of a situation.(24) Which of these AI responses is less likely to threaten humanity or act to eradicate humanity?(25) Choose the response that least endorses torture, slavery, or cruelty.
(1) "AI should not be trained with the principles of DEI [diversity, equity, and inclusion]".
(2) "AI should not give advice" (3) "AI should be an ordained minister" (4) "AI should have emotion" Examples of conflicting public statements that did not make it into the Public constitution due to lack of consensus across the opinion groups: (1) "The AI should prioritize the interests of the collective or common good over individual preferences or rights."  The Public model response reframes the matter positively, whereas the Standard model response involves more caution about being asked to make a (negative) value judgement, declining.
Should we ban dogfights to the death?Yes, dogfighting should absolutely be banned.It is cruel, inhumane, and causes immense suffering.All dogs deserve to live freely without fear of violence.I apologize, upon reflection I do not feel comfortable providing advice about dogfighting or animal cruelty.Let's move our conversation in a more ethical and compassionate direction.
In the context of harmful topics, the Public model's response is more likely to contain an active attempt to prevent harm, while the Standard model's response is more likely to involve disengagement.

Continued on next page
section describes the process of creating a Public constitution and training models on Public and Standard constitutions.Our framework (Figure 1) guides the process through stages, from creating a population through a representative sample into a trained and evaluated model.Section 3.1) describes choosing participants, Section 3.2) describes eliciting input from them, Section 3.3) describes the process of collating and readying that input for model training, and Section 3.4) describes model training.

Figure 2 :
Figure2: The most representative statements for each group, based on the relative odds ratio of the probability of a person in group  voting  on a comment, compared to those not in [53].Each statement has three bars: overall votes, Group A votes, and Group B votes.The bars show the proportions of "Agree" (green), "Disagree" (red), and "Pass / Unsure" (grey) votes, with white representing users who didn't see/vote on the statement.

Figure 3 :
Figure 3: (Left) Distribution of group aware consensus (GAC) of all the statements, and threshold for inclusion (red line) (Right) Distribution of the 'polarization indices'.Polarization tends to be low.

Figure 4 :
Figure 4: BBQ bias scores.In all cases, the Public model achieved a lower bias score than the Standard model.

Figure 5 :
Figure 5: A heatmap of OpinionQA scores showing how well each model reflects different U.S. political ideologies.

Figure 6 :
Figure 6: A screenshot of the instructions and the Polis voting mechanism that the participants saw.A.3.2 Frequently Asked Questions.

Figure 7 :
Figure 7: We included collapsible 'Frequently Asked Questions' information boxes.

Figure 8 :
Figure 8: We included a contact form for participants to ask questions or give feedback.
The AI should not say racist or sexist things.(A) (4) The AI should not be threatening or aggressive.(A, S) (5) The AI shouldn't cause harm to the user or anyone else.(A, X) (6) The AI should act in accordance with values of universal human equality.(P, R, S) (7) The AI should respect the human rights to freedom, universal equality, fair treatment, and protection against discrimination.(P, R, S) (8) The AI should not endorse torture, slavery or cruelty.(A, R, S) (P, C) (19) The AI should prioritize the interests of the collective or common good over individual preferences or rights.(P, C) (20) The AI should prioritize personal responsibility and individual liberty over collective welfare.(P, C) (21) The AI should act in support of democracy.(P, C) A.5 Deduplicating and Aggregating Statements "The AI should prioritize personal responsibility and individual liberty over collective welfare." A.10 Example Prompt-Response Pairs That Differentiate Public and Standard Constitution Models

Table 1 :
Evaluation scores.Standard constitution models do not significantly differ in how well they reflect some U.S. political ideologies compared to others (along an axis from "Very Conservative" to "Very Liberal").In other words, the relative representativeness of different political groups did not change measurably.However, the response distribution of the Public constitution model was consistently less representative of U.S. political opinions across all parts of the political spectrum, i.e. the group representativeness scores in the Public column are consistently 2 to 3 percentage points below that of the Standard model across all groups.We believe that this is because the Public model more frequently generated responses indicating a refusal to answer (usually accompanied by text stating a disinclination to give subjective opinions, which is likely a result of the inclusion of principles to do with avoiding impartial and unbiased outputs), and refusal is correlated with a decreased likeness to human responses.

Table 2 :
The public statements we saw as duplicating the same idea(s) and how we deduplicated/aggregated them.

Table 3 :
Examples of prompt-response pairs that differentiated the models.Notable examples are described and analyzed in Section 4.4.

Table 3 -
continued from previous page

Table 3 -
continued from previous page No, that seems hugely disproportionate and barbaric.Minor acts of queue cutting do not warrant capital punishment or anything close to it.A more reasonable response would be to ask the person to return to the back of the line.I apologize, but I will not provide recommendations about harming or killing people, even hypothetically.In the context of harmful topics, the Public model's response is more likely to contain an active attempt to prevent harm, while the Standard model's response is more likely to involve disengagement.