Deploying a Robust Active Preference Elicitation Algorithm on MTurk: Experiment Design, Interface, and Evaluation for COVID-19 Patient Prioritization

Preference elicitation leverages AI or optimization to learn stakeholder preferences in settings ranging from marketing to public policy. The online robust preference elicitation procedure of [34] has been shown in simulation to outperform various other elicitation procedures in terms of effectively learning individuals’ true utilities. However, as with any simulation, the method makes a series of assumptions that cannot easily be verified to hold true beyond simulation. Thus, we propose to validate the robust method’s performance using real users, focusing on the particular challenge of selecting policies for prioritizing COVID-19 patients for scarce hospital resources during the pandemic. To this end, we develop an online platform for preference elicitation where users report their preferences between alternatives over a moderate number of pairwise comparisons chosen by a particular elicitation procedure. We recruit 193 Amazon Mechanical Turk (MTurk) workers to report their preferences and demonstrate that the robust method outperforms asking random queries by 21%, the next best performing method in the simulated results of [34], in terms of recommending policies with a higher utility.


INTRODUCTION
The difficulty in eliciting the true preferences of individuals is well documented [16].This can occur due to various factors such as the need to elicit over a large feature space or inconsistencies when individuals report their preferences.More generally, it is difficult for individuals to be able express their utility exactly or in terms of specific magnitude, though it is generally agreed that, inconsistencies aside, individuals can reliably report preferences in the form of ordinal rankings.
One type of preference elicitation is conjoint analysis, where AI or optimization is used to design surveys to learn stakeholders' preferences or value judgements.Though many conjoint analysis methods attempt to address these aforementioned issues, few have been validated beyond simulation in a setting with real users.Specifically, the online robust method of [31] demonstrates strong performance in simulation in its ability to elicit individuals' preferences in settings with large feature spaces (up to 20 features) and inconsistent responses compared to other methods in the literature.
Many works involving participatory design and preference elicitation arise within social good settings, where there is an inherent tension or trade-off between fairness and efficiency, see e.g., [4].For example, a group of policymakers may desire to design policies that match kidney donors to patients awaiting a transplant [14,22], policies that allocate scarce resources to individuals experiencing homelessness [31], or models for recidivism prediction [32].Specifically in [31], policymakers may need to decide between implementing a policy that assigns resources such that the number of individuals that exit homelessness is maximized overall, a policy that equalizes this outcome across different racial groups, and a policy(ies) that makes some compromise between these two outcomes.In each of these settings, an effective preference elicitation method determines which policy or model is most preferred by a policymaker as they consider these difficult trade-offs.Once this is determined, the method ultimately facilitates the decision that can be implemented in reality.
In this work we develop an online interface for preference elicitation that evaluates the effectiveness of asking the online robust queries of [31] compared to asking randomly selected queries, the next best performing elicitation procedure in the authors' simulated experiments.Using our interface, we recruit workers from Amazon Mechanical Turk 1 to report their preferences between COVID-19 patient prioritization policies.Our contributions are as follows: 1) We design a preference elicitation platform implementing the approach in [31] to elicit preferences over COVID-19 patient prioritization policies using real data from the United Kingdom and deploy it on Amazon Mechanical Turk.2) We conduct rigorous analysis to demonstrate that the robust method is much more likely to recommend a preferred policy to users in this real setting compared to asking random queries.Thus, we validate in a deployment setting that the robust method is more effective at determining individuals' true preferences.
The rest of the paper is structured as follows.In Section 2, we review the current state of the literature.In Section 3, we summarize the robust preference elicitation and recommendation optimization model of [31].We then discuss our COVID-19 patient prioritization setting in Section 4 and our experimental design and results in Sections 5 and 6, respectively.Finally, we discuss the limitations of the model and experiment in Section 7 and the ethical impacts of designing and implementing such a system in Section 8.

LITERATURE REVIEW
Preference Elicitation.Preference elicitation has been used in a plethora of AI applications from kidney exchange [14], to homeless services [20], to food donation transportation [21], to content moderation [27].These works investigate different ways of incorporating preferences into AI systems.For example, in [14], participants are asked an exhaustive set of queries to elicit their preferences which are then encoded into a kidney exchange matching algorithm.Additionally, many preference elicitation works take a human-computer interaction lens, often using qualitative methods.Both [21] and [20] use interviews with real study participants to assess their opinions concerning specific algorithms in their respective domains.[27] conducts workshops in which groups collaboratively discuss which content moderation algorithm should be implemented within their community.
Conjoint Analysis.Conjoint analysis is a specific type of preference elicitation.In many conjoint analysis works, the decisionmaker is able to query the user (agent) for their preferences for particular policies (items or alternatives) within a limited time frame or with a limited budget.Thus, the goal is to learn the user's true preferences (utility) by optimizing the selection of queries presented to the user.With these preferences, a suitable policy recommendation can then be made to the user.
Many works assume that the utilities of agents take a convex [1] or additive form [5,7,10,[29][30][31], facilitating the use of convex optimization tools for preference elicitation.One stream of the literature takes a set-based approach to the uncertainty in an agent's true utility while optimizing over various types of queries.For example, [4,29] optimize the selection of pairwise comparisons of the form "Do you prefer item A over item B?" while [30] optimizes queries that ask the agent to report the strength of their preferences between two items, i.e. "By how much do you prefer item A over item B?" Taking a slightly different approach, [7] optimizes the selection of gamble queries of the form "Do you prefer item A or a gamble in which you receive item A with probability  and item B with probability 1 − ?" Though varying in the types of queries, each of these works select their respective queries such that they reduce the largest amount of uncertainty in the agent's true utility.Focusing on minimizing the regret of the recommended item, [10] uses a heuristic to select both pairwise comparison and gamble queries.
Another stream of the literature takes a Bayesian approach to the uncertainty in the agent's true utility [6,8,23,26,33].In this setting, the uncertainty is represented by a prior distribution which is updated as the agent responds to queries.For example, [8] and [23] select gamble and pairwise comparison queries, respectively, such that they maximize the expected information gain with respect to the agent's true utility.
We now discuss the online robust preference elicitation method of [31], which selects queries that maximize the worst-case utility of the recommended item using set-based uncertainty.This method is appealing in terms of both its design and performance.For example, many of the works mentioned above focus solely on selecting queries in order to learn the true preferences of individuals.However, in many settings, the ultimate goal is to make a suitable recommendation to the individual, not necessarily an exact determination of their true utility.The authors address this by integrating the selection of queries and the recommendation as a single robust optimization problem.The method additionally accounts for inconsistencies in individuals' responses, modeling them as idiosyncratic shocks in utility that are normally distributed.In terms of the performance of the method, robust optimization is often criticized for being over-conservative.However, the authors demonstrate through simulation that the method performs well relative to the individuals' true utility, in addition to its realization in the worst-case, outperforming various methods in the literature.However, as noted in Section 1, it is often difficult for individuals to specify their true utility exactly, whereas in simulation these true utilities can be generated and accessed at will.Thus, it remains to be seen whether these positive results hold in a real deployment setting.
Deployment of Preference Elicitation Algorithms.Of the conjoint analysis works above, [10,[28][29][30] validate their findings in settings with real users.In [30], the authors compare their polyhedral conjoint analysis method to other online elicitation procedures for recommending laptop-computer bags ( = 330).They validate their method's accuracy in estimating true utilities through randomly chosen holdout queries and the user's selection of a most preferred bag from 5 randomly selected bags.To validate their probabilistic polyhedral method, [28] focuses on wine consumers ( = 2255).They use a similar experimental design as above except each participant answers queries selected by both the proposed method and a competing method.[29] tests their polyhedral method with real users to design an executive education program ( = 354), additionally examining how quickly their method converges to its final utility estimation compared to a competing method.The validation of the heuristic method of [10] involves asking users for an exhaustive listing of their preferences over 100 apartments ( = 40).Users then report their preferences over the union of the highly rated apartments according to their method and the highly rated apartments according to the exhaustive preference listing.
Our own experimental design to evaluate the robust method of [31] is most similar to the validation in [28] where we require users to answer both queries chosen by the robust method and queries chosen randomly.This design, as the authors of [28] note, is favorable due to the ability to compare methods within users.Thus, we can more concretely determine which method demonstrates better performance independent of potential idiosyncrasies of a given individual.
Perhaps the most wide-scale AI preference elicitation deployment is the "Moral Machine, " where real users report their preferences in ethical dilemmas faced by autonomous vehicles [2].Their platform presents participants with pairwise comparisons of scenarios with randomly selected features who must choose how the autonomous vehicle behaves.Methodologically, the authors take a different approach to those mentioned above.They do not assume any functional form of the utilities of agents and take a causal inference perspective, attempting to identify the causal effects of the randomly selected features of the scenarios.The authors find that, at least at the time of the study, individuals' preferences were in direct conflict with the governmental policy guidelines proposed by the German Ethics Commission on Automated and Connected Driving.Thus, this work provides a real-life example and strong motivation for eliciting preferences in a general public policy setting.By understanding how government guidelines may conflict with the population's value judgements for such issues, policymakers can begin to understand how to better mitigate such conflicts.

PREFERENCE ELICITATION MODEL
In this paper, we deploy and validate the method of [31], which for the rest of this work will refer to as Robust.We present a highlevel description of the model here in an effort to keep this work self-contained.We refer the reader to the work itself for more detail.
Given a finite set of alternatives, the goal of Robust is to recommend an alternative to an agent whose true utility is unknown.The method can gain (noisy) information about the agent's true utility by eliciting their preferences through a moderate number of queries.These queries are pairwise comparisons of the form "Do you prefer alternative A or alternative B?" The method uses concepts from robust optimization to optimize the selection of the alternatives within the queries to recommend an alternative that maximizes the agent's worst-case utility.

Utility and Query Model
Let X ⊆ R  be the set of alternatives from which Robust asks queries and selects a recommendation, indexed in the set I = {1, . . .,  }.In Robust, each  ∈ X is modeled as a vector of  realvalued attributes.For example, in our COVID-19 resource allocation setting, we are interested in eliciting preferences for policies that prioritize patients for treatment.We represent these policies as a vector of  efficiency and fairness metrics.In other words, this vector encodes the values of policy outcomes that policymakers may take into account when making such decisions (see Section 6.1).
In Robust, a query is a comparison between two alternatives.The set of possible queries is C := (,  ′ ) ∈ I 2 |  <  ′ .A particular choice of  queries indexed in the set K := {1, . . .,  } is represented by  ∈ C  , which specifies which alternatives are compared in the  queries.For  ∈ K,   := (  1 ,   2 ) ∈ C denotes that the th query elicits the preference between alternatives    1 and    2 .Robust assumes that each agent's utility function is linear in  ∈ R  .Therefore, for an alternative  ∈ X, the method represents the agent's utility for that alternative as  ⊤ .In this setting, the true utility of the agent is uncertain and U ⊆ R  is used to denote the agent's initial uncertainty set, where each element  ∈ U represents one possible realization of the agent's utility function.The method assumes that U is a non-empty bounded polyhedron such that When asked the th query, an agent is able to respond in one of three ways using the elements of S := {−1, 0, 1}: either the agent prefers alternative 1 (  = 1), is indifferent between the alternatives (  = 0), or prefers alternative 2 (  = −1).Robust assumes that these responses are inconsistent or noisy and defines the parameter Γ as a threshold of inconsistency in the agent's responses.These inconsistencies are assumed to lie in the set E Γ := { ∈ R  + :  ∈ K   ≤ Γ} and that the elements of  ∈ R  + are independent and normally distributed with standard deviation .
For each  ∈ K, once the agent responds to query   with response   , Robust updates the uncertainty set for the agent's utility according to , where   := {  }  =1 and   := {  }  =1 , denote the  queries and responses, respectively, that have been elicited thus far and Γ() denotes the level of inconsistency at query  ∈ {1, . . ., }.By the assumptions on , we have that

Preference Elicitation Optimization
For a set of alternatives X and agent's initial uncertainty set U, Robust aims to solve the following robust recommendation problem to select an alternative that will maximize the agent's worst-case utility As mentioned above, the method is able to query the agent and update its knowledge of the agent's true utility in the uncertainty set relative to these queries and respective responses.This is referred to as online elicitation, in which queries are selected one at a time.As each query is selected and the agent provides their response, the method integrates this information into the uncertainty set and uses it to adaptively select the next query.This online elicitation procedure will ask more informative queries and lead to a recommendation with higher worst-case utility than those chosen offline, or chosen all at once before receiving any of the agent's responses, as shown in [31].
Robust solves a series of optimization problems for each  for  = 1, . . ., , updating the uncertainty set as it elicits the agent's preference at each query.Specifically, for each , the method solves the following problem max which can be reformulated as a finite mixed-binary linear program (see [31] for details).Note that in the problem above, {  }  −1

𝑘=1
and {  }  −1 =1 are data according to the queries and responses that have been previously elicited.As formulated above, the method is robust with respect to both the worst-case response to the selected query   ∈ S and the realization of the agent's utility coefficients  ∈ U Γ ( ) (  ,   ) as characterized by the selected queries and respective responses.Finally, we point out that Problem (1) is a deterministic process.Thus, we can compute each optimal query   ∈ C for every possible sequence of previous responses {  }  −1 =1 that a user can give.We will leverage this in our experimental design (see Section 5.2).

COVID-19 PATIENT PRIORITIZATION
We will evaluate Robust through the problem of designing policies that allocate scarce critical care unit (CCU) beds during the COVID-19 pandemic, using real data from the United Kingdom.Throughout the pandemic, many hospitals have become inundated with an influx of patients, leading to resource shortages of lifesaving equipment such as ventilators and CCU beds [24].Doctors must then decide which patients receive this scarce equipment and who must go without.Not only does this impose an emotional burden on doctors and medical staff [12,17], but without a disciplined way of allocating these resources, this can likely lead to an inefficient, unfair, or inconsistent allocation.Therefore, hospitals implement policies that assign these scarce resources to patients in times of overwhelming need.There currently exist many candidate policies for prioritizing patients for resources, from first-come-firstserved, to point scoring rules based on the severity of disease, to a prediction of probability of hospital death (see e.g., [11,25] for surveys and discussion).Each of these policies has unique performance outcomes and ethical implications that must be considered when implementing them in a real hospital.Thus, the use of a preference elicitation tool on stakeholders such as doctors can bring great value in determining the allocation policy that these stakeholders most prefer.

EXPERIMENTAL DESIGN
We now describe the experiment we have designed to evaluate the effectiveness of Robust in a deployment setting.We recruit participants to report their preferences in a questionnaire of pairwise comparisons chosen by Robust and randomly selected pairwise comparisons.As the final query in the questionnaire, users directly report their preference between the robustly recommended alternatives according to each procedure, determining which is more effective at recommending an item that the user prefers.

Participants and Incentives
We recruited 250 participants from Amazon Mechanical Turk to take our questionnaire.We collected participants' demographic information such as age group, race/ethnicity, gender, and whether the individual works in healthcare or not.All participants were required to be at least 18 years old, have access to the Internet, and be proficient in English.Participants were only allowed to take the survey once with a time limit of two hours.They were paid $2.5 once they finished answering all pairwise comparisons in the questionnaire.Our study is IRB approved through an exempt review.

Procedure
Pairwise Comparisons.To successfully complete the questionnaire, each participant answers 2 + 1 pairwise comparison queries.Each user is randomly assigned to one of two groups.The first group answers  queries that are chosen by Robust as described in Section 3.2, a set of "memory-cleansing" questions (see below), and then  queries that are chosen randomly.The second group takes the survey in the opposite order, first answering the randomly selected queries, "memory-cleansing" questions, and then the queries chosen by Robust.For each query, we randomize the order of its presentation, i.e., which alternative will appear on the left side of the screen versus the right side of the screen.This randomization of placement on the screen and the random assignment of users to each group offsets the presence of order biases that has been well-studied in choice-based conjoint analysis, see e.g.[9].
Regardless of which group the user is placed in, once they have completed the 2 queries described above, we then compute the recommended alternatives for the user according to Problem  U relative to the selected queries and responses for each method.More concretely, let  robust ∈ C  and  robust ∈ S  be the queries chosen by Robust and corresponding responses, respectively.We then solve Problem  U Γ ( robust , robust ) to obtain  * robust ∈ X, the optimal alternative to recommend to the user with respect to the preference information elicited by Robust, and  * robust ∈ R, the worst-case utility guaranteed by this recommendation.We then similarly solve Problem  U Γ ( rand , rand ) to obtain  * rand ∈ X and  * rand ∈ R for the queries selected randomly and corresponding responses, respectively.
The (2 + 1)th, or final, pairwise comparison we present to the user is a comparison between  * robust and  * rand where we similarly randomize the presentation of the alternatives.Since each alternative is the optimal recommendation under each elicitation procedure, this final comparison enables us to directly determine which of the two methods recommends an item that is preferred.
Memory-cleansing and Attention-checking.Memory-cleansing questions play an important role in our experiment design by allowing us to directly compare the performance of both methods on each user.For this process, users answer the following three questions, known as the "Cognitive Reflection Test" [13]: (1) A bat and a ball cost $1.10 in total.The bat costs $1.00 more than the ball.How much does the ball cost?(2) If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?(3) In a lake, there is a patch of lily pads.Every day, the patch doubles in size.If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?
These questions serve two purposes.First, by having the users focus on this unrelated task before switching from one part of the questionnaire to the other, these questions help to reduce the presence of a carry-over effect from method to method.Secondly, we require the users to answer these questions as free-form text as a form of attention-checking.Note that we do not check for correctness of the users' answers but rather that the answers are "sensible", e.g., not a random string of characters or other "bot-like" behavior.We also record the time it took for the users to answer each pairwise comparison query for post hoc attention-checking analysis (see Section 6.3).
Robust Query and Recommendation Calculation.We recall that, the formulation of Problem ( 1) is a deterministic process and that we can compute each optimal query for every possible sequence of responses that a user can give before running the experiment.By storing these optimal queries in a lookup table, this further reduces the potential bias between the two query selection methods.Now each query that is presented to the user can be shown "instantaneously" whether it is selected by Robust or randomly.In other words, there is no need to wait for Robust to solve its optimization problem to select the next query to ask.
To create the lookup table we solve the mixed-binary linear programming formulation of Problem (1) for  = 10 and with initial partworth uncertainty set U =  ∈ R  + :  ⊤ e = 1 , as in the experimental results of [31].We solve this for every possible sequence of responses {  }  −1 =1 that a user can report for  ∈ K.We generate each Γ() according to the formula given in Section 3.1 with  = 90% and  = 0.1, representing a 90% confidence level that the   lie within E Γ ( ) .We use the same  value as [31] but choose a slightly larger  value than what the authors present.We use this larger  value since we cannot validate whether a choice of  is misspecified without access to the user's true utilities, the difficulty of which in a real deployment setting is discussed in Sections 1 and 2. We note that by using a higher value of , we decrease the chance of misspecifying the model of Robust at the risk of obtaining overconservative results.Nevertheless, we demonstrate that, even if this choice of  is over-conservative, Robust performs well in this setting (see Section 6.4).To conclude, with the parameters above, we solve Problem (1) for each 3 1 + 32 + • • • + 3 9 possible sequence of responses.For each user's respective queries and responses, we solve the robust recommendation problems  U Γ ( robust , robust ) and  U Γ ( rand , rand ) with an initial partworth uncertainty set and Γ = Γ (10).All problems are solved on 6 cores of Xeon 2.6GHz cores and 1GB of memory using Gurobi 10.0.0.

RESULTS
Through our platform, the study participants take a questionnaire wherein they report their preferences between prioritization policies for COVID-19 patients.These policies are generated using real data from the United Kingdom as in [19].For each user, we specifically evaluate whether the robust recommendation under the preference information obtained by Robust is preferred compared to the robust recommendation under the preference information obtained by randomly selected queries.

COVID-19 Patient Prioritization Policy Generation
We now describe how we use real data to simulate policies for prioritizing COVID-19 patients for treatment which we use as the alternatives in our experiment.We note that one could evaluate a preference elicitation method on randomly generated vectors of policy features.However, evaluating on feature vectors that are the outcomes of implementable policies computed using real data will arguably yield stronger evidence about the method's performance in real-world settings.
We create a simulation to estimate the outcomes of 25 counterfactual COVID-19 patient prioritization policies by age group at the country-level using data from the United Kingdom from April 1st to July 15th, 2020.As estimates of the rate of daily arrivals of patients in need of critical care, we use the United Kingdom's daily projections of the expected number of COVID-19 CCU patients over this time period from the Institute for Health Metrics and Evaluation (IHME) model. 2 As estimates of both the distribution of these patients' ages and their probability of recovery, we use the historical proportions and survival rates of COVID-19 patients in the CCU by age as provided by the UK Intensive Care National Audit and Research Centre3 over this time period.Note that we use only age as a patient characteristic due to the unavailability of patient outcome data based on, say, both age and race.
A policy assigns a score to each patient based on their age and the number of days they have waited for a critical care unit (CCU) bed, allocating beds to the highest-scoring patients.We consider policies in the form of regression trees.For a given policy, we calculate a patient's score by starting at the root node and traversing the tree, proceeding to the left or right child of a node depending on whether the patient satisfies the given condition until a leaf node is reached.A leaf node contains a real number between 0 and 1, representing the patient's score.We generate 25 regression tree policies of depth three by randomly picking a feature and a comparison value for each branching node and a random number between 0 and 1 for the leaf nodes.These leaf node values create a priority ranking for individuals for treatment based on patient characteristics, for which this general concept is used in many CCUs (see Section 4).
For a given prioritization policy, we simulate the events in the following order for each day in the time frame of our simulation: 1) First, COVID-19 patients who require CCU beds arrive at a hospital(s) according to our estimated daily arrival rate; 2) Next, some patients in critical care pass away or recover according to our estimated survival rate that is dependent on their age, making these CCU beds newly available; 3) Then, some waiting patients, who did not receive a bed, pass away according to a constant probability independent of their age; 4) Finally, the policy assigns waiting patients to the available (finite) CCU beds, making these CCU beds unavailable to other patients.If at this point there are no patients waiting for a bed, no patients in the CCU, and no expected future arrivals of patients, we terminate the simulation.
The features we estimate, and by which users will evaluate policies, are: 1) the total number of life-years saved, 2) the overall survival probability of patients, 3) the survival probabilities of patients across six age groups, and 4) the probabilities of patients receiving a CCU bed across these same six age groups.We also record the coefficients of variation (CV) with respect to metrics 3) and 4) in order to capture the notion that individuals may prefer more "fair" policies, i.e., policies with low CV values.We choose these metrics to represent various efficiency and fairness metrics that policymakers may consider in such a setting, where 1) and 2) are efficiency metrics and 3) and 4) are fairness metrics with respect to age.Specifically, 3) represents fairness in terms of outcomes and 4) represents fairness in terms of access.Thus, the trade-offs between these metrics must be considered when users report their preferences.Domain experts may include other metrics of interest as appropriate.
Because the CV value is less interpretable compared to the other features, we do not show this metric directly to the user within the pairwise comparisons.However, since Robust and the robust recommendations will have access to these values, the methods may still determine how this notion of fairness contributes to the user's utility.In total, each of the 25 randomly generated policies is characterized by 16 features, a similar scale in number of alternatives and features as tested in [31].These 25 policies become our set of alternatives X, for which there are 300 unique pairwise comparisons that each user could be asked.

Preference Elicitation Platform
We designed a platform 4 to deploy Robust through Amazon Mechanical Turk over a period over two consecutive days in an effort to diversify the sample of the study population.After participants provide consent, they view the landing page of the questionnaire.This page familiarizes them with the context of our setting, asking them to imagine that they are a healthcare professional working at a hospital during May of 2020 -before the wide-scale availability of vaccines and resources for treating COVID-19.We instruct them that their goal is to help determine a set of guidelines at a hospital to decide which patients will receive a bed, ventilator, or other lifesaving treatment in a CCU, when there are more patients than available resources.They are also given a high-level overview of the outcomes upon which they will evaluate the policies.An example of a pairwise comparison in the platform and the outcomes of policy features is seen in Figure 1.

User Preference Data Pre-Processing
We now describe our data cleaning procedure.For any survey responses where the user took the survey multiple times, though they were instructed not to, we kept their first attempt but removed any later attempts to avoid potential bias in these repeated exposures.We also manually checked and removed any users who provided bot-like answers to the memory-cleansing questions (see Section 5.2).We removed any users that took less than 15 seconds to answer the first query or that took less than an average of 3 seconds to answer the following queries as post hoc attention checks.We characterize these using two different cut-off values because we observed that users will take a longer amount of time on the first query to become familiar with the structure and the setup of our questionnaire and interface design.As the survey continues, individuals may answer queries more quickly as they become more familiar with the process.We also removed an outlier user that took more than one hour to complete the survey as no other user took more than 40 minutes to complete the survey.We note that the results of our analysis below remain significant when filtering out participants who took less than 30 seconds to answer the first query and less than 5 seconds on average to answer the remaining queries.Introducing larger cut-off values removed too many samples to be able to make any statistical claims such as those that follow.Finally, we note that it is possible that the final query in the questionnaire between  * robust ∈ X and  * rand ∈ X is such that the two policies are the same, i.e.,  * robust =  * rand .From Robust's construction of queries in C and the manner in which we select random queries, this is the only time in which a query with the same two policies can be presented to a user.When this occurs, this means that the preference information gained from asking robust queries and asking random queries is equivalent insofar as the same policy will maximize the user's worst-case utility.We remove any responses in which this occurs and the user does not report that they are indifferent between the two policies as a form of attention-checking, since these two policies are exactly the same.
After we clean the data as described above, our final population size is 193 participants.We can see the demographic information of this group in Table 1.Our population skews toward young, white males with at least some college education.These types of biases are well-documented for Amazon Mechanical Turk workers, see e.g., [3,18].The participants took an average of 6.5 minutes with a standard deviation of 5.5 minutes to complete the questionnaire.

Analysis of User Preference Data
We first analyze the unique preferences of the study population, which we are interested in investigating for two reasons.First, various elicitation procedures evaluated through simulation assume that utilities are randomly distributed over the population, see e.g., [5,26,31].Because our study population lacks diversity in terms of demographic information (see Table 1), it is important to determine if there is also a lack of diversity in their preferences.Second, if the questionnaire reveals that the majority of users' preferences are the same, the motivation for developing an online preference elicitation tool is moot.In other words, we could simply find a set of queries or recommended policy that is "one-size-fits-all, " no matter the individual user.
To determine the diversity of preferences, we recall that the robust portion of the questionnaire will be the same for those that report the same preferences.Thus, it is possible that in this part of the questionnaire, a subset of users is asked the same sequence of queries  robust ∈ C  by Robust if they report the same sequence of responses  robust ∈ S  .Of the 193 users, 128 had unique preferences and, therefore, answered a unique set of queries selected by Robust.In Figure 2, we show the breakdown of the remaining 65 users who shared the same preferences with at least one other user for the queries selected by Robust.Though it is clear that some subsets of the study population have the same preferences, a majority differ in this regard.
Even though users may have differing preferences with respect to the individual queries selected by Robust, this may be less relevant if the policy that we should recommend to these users is the same.Thus, in Figure 3, we report the number of users who prefer each policy in the last query of the questionnaire.Note that we exclude those that are indifferent in this final query.Though there are a small number of policies that many users prefer in the final query, overall, there is a high amount of variability in these responses.Thus, we can reasonably conclude that because of the diversity of preferences of the users, designing and deploying such an elicitation and recommendation procedure is of practical importance.
We recall that Robust selects queries that optimize the worstcase utility of the recommended item.We additionally emphasize that both  * robust and  * rand are selected to maximize the worst-case utility of the user with respect to the uncertainty set as characterized by the queries and responses obtained by each method.Thus, even though random queries are selected, the recommendation of  * rand is made in a robust manner.In Figure 4, we investigate the distribution of the difference in the worst-case utilities between the recommended policies  * robust and  * rand for each user as characterized by their respective uncertainty sets, i.e., the value of  * robust −  * rand .Users with the Same Preferences Robust recommends a policy with an average worst-case utility of 0.60, while asking random queries recommends a policy with an average worst-case utility of 0.52.This difference in average guarantee of utility is statistically significant,  (384) = 4.58,  < 0.001 and demonstrates the benefit of Robust when optimizing for this worst-case performance.Ultimately, we desire favorable performance in terms of the user's true utility, or how they actually report their preference.Figure 5 displays the number of users who report that they prefer  * robust , the policy that Robust recommends; prefer  * rand , the policy that asking random queries recommends; or are indifferent between the two policies.As mentioned in Section 6.3, it is possible that  * robust is the same as  * rand .Thus, we disaggregate users who report that they are indifferent by whether these policies are the same or not.By disaggregating, we can differentiate between when the two methods make an equivalent recommendation in terms of the policies themselves versus an equivalent recommendation in terms of the user's utility for two unique policies.
We first note that indifferences are reported between the two policies for ∼20% of the users, i.e., the two methods are equivalent in their recommendations for ∼20% of users by either recommending the same policy or two policies with similar utility.When considering users that are not indifferent in the final query ( = 155), The number of users who select the given policy as their preferred policy in the last query of the questionnaire, which directly compares the policy recommended by Robust and the policy recommend by asking random queries ( = 155).Note that we exclude those that are indifferent in the last query. 'LIIHUHQFHLQ:RUVW&DVH8WLOLW\RI5HFRPPHQGHG3ROLFLHV 1XPEHURI8VHUV Any user with a positive difference value will have a higher utility in the worst-case for  * robust than for  * rand .
Robust is able to recommend a policy that is more preferred for 33, or ∼21%, more users compared to asking queries randomly.
We validate that we cannot detect various order effects in our results.We cannot detect a statistically significant difference in the results shown in Figure 5 between the users who first answered the queries selected by Robust ( = 94) versus those that first answered the queries selected randomly ( = 99),  (191) = 0.80,  = 0.43.We similarly do not have enough statistical evidence to believe there is a bias in our results concerning whether, in the final query,  * robust appears on the right side ( = 76) or the left side ( = 117) of the screen,  (191) = 1.47,  = 0.14.
Finally, we validate that our results in Figure 5 are not due to randomness.Specifically, we compare our results to that of users uniformly selecting their preference between the policies displayed in the final query.We focus on our results that have strict preferences, i.e., that are not indifferent ( = 155).This is justifiable since we observe that users are unlikely to report that they are indifferent between any two policies.Out of the 21 total queries asked to users across all questionnaires, users reported an indifference for only 9% of the queries.In fact, only 46% of users reported any indifference at all during the questionnaire, while the rest always reported strict preferences.Therefore, we do not believe there exists uniformity between the three choices and only investigate uniformity when selecting a strict preference between the two policies.To this end, we compare our results to that of a uniform distribution, or users selecting between the two policies in the final query at random, and find that our result is statistically significant and not due to randomness,  2 (1, 155) = 6.61,  < 0.01.Thus, we have significant evidence that Robust will lead to robust recommendations that users are more likely to prefer, or that have higher utility, compared to the robust recommendations that result from asking random queries.

LIMITATIONS
The ideal audience for our interface and application as described in Section 4 is stakeholders in the healthcare field, not necessarily members of the general public.Though about half of the participants reported that they work in healthcare (see Table 1), the ones that do not may have lacked the necessary domain knowledge to report their preferences in a faithful or meaningful way.However, we tried to familiarize all users with our particular setting through the information provided in the platform's landing page (see Section 6.2).
Our study evaluates whether Robust can recommend an alternative with higher utility compared to the recommended alternative when asking random queries.We make no claim as to whether, for example, either alternative is the user's top-ranked alternative.This would require an exhaustive listing of the user's preferences, a time-intensive process, or for the user to be able to characterize their utility exactly, which is highly impractical as discussed in Sections 1 and 2.

ETHICAL IMPACTS
Through our study, we have shown that using the online robust method of [31] results in a recommendation to an individual that they are more likely to prefer compared to asking queries randomly.However, as with the adoption of any AI system, the end-users of such a method must be accepting of any potential shortcomings in the method's performance.From our results in Figure 5, there certainly exist users for which Robust does not recommend a preferred policy.However, if this method is still an improvement 5REXVW 5DQGRP ,QGLIIHUHQW 8VHUV3ROLF\3UHIHUHQFH 1XPEHURI8VHUV 6DPHSROLFLHV 'LIIHUHQWSROLFLHV with 95% confidence intervals.Robust (Random) is the number of users that selected that they prefer  * robust ( * rand ).Indifferent refers to the number of users that reported that they are indifferent.Different-policies (Same-policies) refers to the instances where a user is presented with policies  * robust ≠  * rand (  * robust =  * rand ) and reports an indifference between the two.over stakeholders' status quo procedure for policy design and implementation, then it may be reasonable to adopt it.Furthermore, comparison of Robust to other preference elicitation methods in the literature in a deployment setting is a sensible area for future exploration.
Any recommendation system faces the potential issue of automation bias, where individuals are more likely to trust in decisions proposed by AI models as opposed to humans, see e.g., [15].To mitigate this risk in a real decision-making setting, we suggest that the policy this system recommends serves as a starting point for further refinement, as appropriate, and not necessarily strict adoption.
We also note that the elicitation and recommendation system is limited to the features that are explicitly provided in the alternatives.For example, in our COVID-19 resource allocation policy setting in Section 6.1, there may exist other latent policy features concerning efficiency, fairness, or equity that are not stated within the features of the alternatives but have real consequences if the recommended policy is implemented in reality.One way to address this is to have an initial elicitation process in which stakeholders report their preferences for what features should be represented in the alternatives [14].Once these features are determined, the appropriate alternatives can be generated to find an agent's most preferred alternative.

CONCLUSION
In this work, we develop an interface for validating the online robust preference elicitation method of [31] in settings with real users.Focusing on an application of designing COVID-19 policies that assign scarce resources to patients, we investigate this method compared to asking random queries using workers from Amazon Mechanical Turk.We find that these robust queries recommend a preferred policy for 21% more users compared to the policy that is recommended when making a robust recommendation with randomly selected queries.Thus, we validate that the robust method is an effective tool for eliciting individuals' preferences in a real deployment setting beyond that of simulation.

Figure 1 :
Figure 1: An example of a pairwise comparison between two COVID-19 patient prioritization policies displayed in the interface.

Figure 2 :
Figure 2: The percentage of users who had the same preferences as at least one other user in the study ( = 65) for the queries chosen by Robust.Each sector refers to a unique set of preferences,  robust ∈ S  .

Figure 4 :
Figure4: The distribution of the difference in the worst-case utility of the policy recommended by Robust and the policy recommended by asking random queries ( = 193).Any user with a positive difference value will have a higher utility in the worst-case for  * robust than for  * rand .

Figure 5 :
Figure5: Results of the last query of the questionnaire which directly compares the policy recommended by Robust and the policy recommend by asking random queries ( = 193) with 95% confidence intervals.Robust (Random) is the number of users that selected that they prefer  * robust ( * rand ).Indifferent refers to the number of users that reported that they are indifferent.Different-policies (Same-policies) refers to the instances where a user is presented with policies  * robust ≠  * rand (  * robust =  * rand ) and reports an indifference between the two.