Planning to Fairly Allocate: Probabilistic Fairness in the Restless Bandit Setting

Restless and collapsing bandits are often used to model budget-constrained resource allocation in settings where arms have action-dependent transition probabilities, such as the allocation of health interventions among patients. However, SOTA Whittle-index-based approaches to this planning problem either do not consider fairness among arms, or incentivize fairness without guaranteeing it. We thus introduce ProbFair, a probabilistically fair policy that maximizes total expected reward and satisfies the budget constraint while ensuring a strictly positive lower bound on the probability of being pulled at each timestep. We evaluate our algorithm on a real-world application, where interventions support continuous positive airway pressure (CPAP) therapy adherence among patients, as well as on a broader class of synthetic transition matrices. We find that ProbFair preserves utility while providing fairness guarantees.


INTRODUCTION
Restless multi-armed bandits (RMABs) are used to model budgetconstrained resource allocation tasks in which a decision-maker must select a subset of arms (e.g., projects, patients, assets) to receive a beneficial intervention at each timestep, while the state of each arm evolves over time in an action-dependent, Markovian fashion.Such problems are common in healthcare, where clinicians may be tasked with monitoring large, distributed patient populations and determining which individuals to expend scarce resources on so as to maximize total welfare.RMABs have been proposed to determine which inmates should be prioritized to receive hepatitis C treatment in U.S. prisons [2], and which tuberculosis patients should receive medication adherence support in India [24].
Current state-of-the-art approaches to solving RMABs rely on the indexing work introduced by Whittle [42].While the Whittle index solves an otherwise PSPACE-complete problem in an asymptotically optimal fashion by decoupling arms [41], it fails to provide any guarantees about how pulls will be distributed among arms.
Though the intervention is canonically assumed to be beneficial for every arm, the marginal benefit (i.e., relative increase in the probability of a favorable state transition) varies in accordance with each arm's underlying state transition function.Consequently, Whittle index-based maximization of total expected reward without regard for distributive fairness empirically allocates all available interventions to a small subset of arms, ignoring the rest [29].
There are many application domains where a bimodal distributive outcome may be perceived as unfair or undesirable by beneficiaries and decision-makers, thus motivating efforts to incentivize or guarantee distributive fairness.In the aforementioned healthcare examples, resource constraints and variation in transition dynamics interact.A practical consequence is that a majority of patients will never receive the beneficial intervention(s) in question.This, in turn, means that their clinical outcomes will be strictly worse in expectation than they would be under a policy that guaranteed a non-zero probability of receiving the intervention at each timestep.
To improve distributive fairness, we explore whether it is possible to modify the Whittle index to guarantee each arm at least one pull per user-defined time interval, but find this to be intractable.We then introduce ProbFair, a state-agnostic policy that maps each arm to a fairness-constraint satisfying, stationary probability distribution over actions that takes the arm's transition matrix into account.At each timestep, we then use a dependent rounding algorithm [39] to sample from this probabilistic policy to produce a budget-constraint satisfying discrete action vector.
We evaluate ProbFair on a randomly generated dataset and a realistic dataset derived from obstructive sleep apnea patients tasked with nightly self-administration of continuous positive airway pressure (CPAP) therapy [16,17].
Our core contributions include: (i) A novel approach that is both efficiently computable and reward maximizing, subject to the guaranteed satisfaction of budget and probabilistic fairness constraints.(ii) Empirical results demonstrating that ProbFair is competitive vis-à-vis other fairness-inducing policies, and stable over a range of cohort composition scenarios.

RESTLESS MULTI-ARMED BANDIT MODEL
Here, we give an overview of the restless multi-armed bandit (RMAB) framework, along with our proposed extension, which takes the form of a fairness-motivated constraint.A restless multi-armed bandit consists of  ∈ N independent arms, each of which evolves over a finite time horizon  ∈ N, according to an associated Markov Decision Process (MDP).Each arm's MDP is characterized by a 4tuple (S, A, ,  ) where S represents the state space, A represents the action space,  represents an |S|×|A|×|S| transition matrix, and  : S → R represents a local reward function that maps states to real-valued rewards.Appendix A summarizes notation; note that [ ] denotes the set {1, 2, . . .,  }.
States, actions, and observability: We specifically consider a discrete two-state system S {0, 1} where 1 (0) represents being in the "good" ("bad") state, and a set of two possible actions A {0, 1} where 1 represents the decision to select ("pull") arm  ∈ [ ] at time  ∈ [ ], and 0 represents the choice to be passive (not pull).In the general RMAB setting, each arm's state    is observable.We consider the partially-observable extension introduced by Mate et al. [24], where arms' states are only observable when they are pulled.Otherwise, an arms' state is replaced with the probabilistic belief    ∈ [0, 1] that it is in state 1.Such partial observability captures uncertainty regarding patient status and treatment efficacy associated with outpatient or remotely-administered interventions.
Transition matrices: Each arm  is characterized by a set of transition matrices , where   , ′ represents the probability of transitioning from state  to state  ′ when action  is taken.We assume  to be (a) static and (b) known by the agent at planning time.Assumptions (a) and (b) are likely to be violated in practice; however, they provide a useful modeling foundation, and can be modified to incorporate additional uncertainty, such as the requirement that transition matrices must be learned [15].Clinical researchers often use longitudinal data to construct risk-adjusted transition matrices that encode cohort-specific transition probabilities.These can guide patient-level decision-making [40].
Consistent with previous literature, we assume strictly positive transition matrix entries, and impose four structural constraints: (a)  0 0,1 <  0 1,1 ; (b)  1 0,1 <  1 1,1 ; (c)  0 0,1 <  1 0,1 ; (d)  0 1,1 <  1 1,1 [24].These constraints are application-motivated, and imply that arms are more likely to remain in a "good" state than change from a bad state to a good one, and that a pull is helpful when received.In the absence of such constraints, the effect of the intervention may be superfluous or harmful, rather than desirable.
Objective and constraints: In the canonical RMAB setting, the agent's goal is to find a policy  * that maximizes total expected reward arg max  E  [( ())] while satisfying a budget constraint,  ≪  ∈ N, which allows the agent to select at most  arms at each timestep.We consider a cumulative reward function,
We extend this model by introducing a Boolean-valued, distributive fairness-motivated constraint, which may take one of two general forms: (1) Time-indexed: A function  ∪  ∈ [ ] { ì   } which is satisfied if each arm is pulled at least once within each user-defined time interval  ≤  (e.g., at least once every seven days), or a minimum fraction  ∈ (0, 1) of times over the entire time horizon [20].
(2) Probabilistic: A function  ′ ( ì   | ì   ∼ ì   ∀) which operates on the stationary probability vector ì   , from which discrete actions are drawn, by requiring the probability that each arm receives a pull at any given  to fall within an interval [ℓ, ] where 0 < ℓ ≤   ≤  ≤ 1.

CONTEXT, MOTIVATION & RELATED WORK
In this section, we motivate our ultimate focus on probabilistic fairness by revisiting the distribution of pulls under Whittle-index based policies.We begin by providing background information on the Whittle index, and then proceed to ask: (1) Which arms are ignored, and why does it matter?(2) Is it possible to modify the Whittle index so as to provide a time-indexed fairness guarantee for each arm?In response to the latter, we demonstrate that timeindexed fairness guarantees necessitate the coupling of arms, which undermines the indexability of the problem.We then identify prior work at the intersection of algorithmic fairness, constrained resource allocation, and multi-armed bandits, and identify desiderata that characterize our own approach.

Background: Whittle Index-based Policies
Pre-computing the optimal policy for a given set of restless or collapsing arms is PSPACE-hard in the general case [28].However, as established by Whittle [42] and formalized by Weber and Weiss [41], if the set of arms associated with a problem are indexable, we can decouple the arms and efficiently solve the problem using an asymptotically-optimal heuristic index policy.Mechanics: At each timestep  ∈ [ ], the value of a pull, in terms of both immediate and expected discounted future reward, is computed for each decoupled arm,  ∈ [ ].This value-computation step relies on the notion of a subsidy, , which can be thought of as the opportunity cost of passivity.Formally, the Whittle index is the subsidy required to make the agent indifferent between pulling and not pulling arm  at time .(Per Section 2,  denotes the probabilistic belief that an arm is in state  = 1; for restless arms, The value function   () represents the maximum expected discounted reward under passive subsidy  and discount rate  for arm  with belief state    ∈ [0, 1] at time : Once the Whittle index has been computed for each arm, the agent sorts the indices, and the  arms with the greatest index values receive a pull at time , while the remaining  −  arms are passive.Weber and Weiss [41] give sufficient conditions for indexability: Definition 3.1.An arm is indexable if the set of beliefs for which it is optimal to be passive for a given , B * () = { | ∀ ∈ Π *  ,  () = 0}, monotonically increases from ∅ to the entire belief space as  increases from −∞ to +∞.An RMAB is indexable if every arm is indexable.
Indexability is often difficult to establish, and computing the Whittle index can be complex [21].Prevailing approaches rely on proving the optimality of a threshold policy for a subset of transition matrices [27].A forward threshold policy pulls an arm when its state is at or below a given threshold, and makes the arm passive otherwise; the converse is true for a reverse threshold policy.Mate et al. [24] give such conditions for this RMAB setting, when  () = , and provide an algorithm, Threshold Whittle, that is asymptotically optimal for forward threshold-optimal arms.Mate et al. [25] expand on this work for any non-decreasing  () and present the Risk-Aware Whittle algorithm.

Motivation: Individual Welfare & Whittle
Bimodal allocation: Existing theory does not offer any guarantees about how the sequence of actions will be distributed over arms under Whittle index-based policies, nor about the probability with which a given arm can expect to be pulled at any particular timestep.Prins et al. [29] demonstrate that Whittle-based policies tend to allocate all pulls to a small number of arms, neglecting most of the population.We present similar findings in Appendix B.
This bimodal distribution is a consequence of how the Whittle index prioritizes arms.Whittle favors arms for whom a pull is most beneficial to achieving sustained occupancy in the "good" state, regardless of whether this results in the same subset of arms repeatedly receiving pulls.While the structural constraints in Sec. 2 ensure that a pull is beneficial for every arm, marginal benefit varies.Since reward is a function of each arm's underlying state, arms whose trajectories are characterized by a relative-but not absolute-indifference to the intervention are likely to be ignored.
Ethical implications: This zero-valued lower bound on the number of pulls an arm can receive aligns with a utilitarian approach to distributive justice, in which the decision-maker seeks to allocate resources so as to maximize total expected utility [3,23].This may be incompatible with competing pragmatic and ethical desiderata, including egalitarian and prioritarian notions of distributive fairness, in which the decision-maker seeks to allocate resources equally among arms (e.g., Round-Robin), or prioritize arms considered to be worst-off under the status quo, for some quantifiable notion of worst-off that induces a partial ordering over arms [33,38].We consider the worst off to be arms who would be deprived of algorithmic attention (e.g., not receive any pulls), or, from a probabilistic perspective, would have a zero-valued lower bound on the probability of receiving a pull at any given timestep.
Why algorithmic attention?This choice is motivated by our desire to improve equality of opportunity (i.e., access to the beneficial intervention) rather than equality of outcomes (i.e., observed adherence).The agent directly controls who receives the intervention, but has only indirect control (via actions) over the sequence of state transitions an arm experiences.Additionally, proclivity towards adherence may vary widely in the absence of restrictive assumptions about cohort homogeneity, and focusing on equality of outcomes could thus entail a significant loss of total welfare.
Distributive fairness and algorithmic acceptability: To realize the benefits associated with an algorithmically-derived resource allocation policy, practitioners tasked with implementation must find the policy to be acceptable (i.e., in keeping with their professional and ethical standards), and potential beneficiaries must find participation to be rational.
With respect to practitioners, many clinicians report experiencing mental anguish when resource constraints force them to categorically deny a patient access to a beneficial treatment, and may resort to providing improvised and/or sub-optimal care [6].Providing fairness-aware decision support can improve acceptability [18,32] and minimize the loss of reward associated with ethically-motivated deviation to a sub-optimal but equitable approach such as Round-Robin [8,9].For beneficiaries, we posit that an arm may consider participation rational when it results in an increase in expected time spent in the adherent state relative to non-participation (e.g., due to receiving a strictly positive number of pulls in expectation).

Time-indexed Fairness and Indexability
We now consider whether it is possible to modify the Whittle index to guarantee time-indexed fairness while preserving our ability to decouple arms.Unfortunately, the answer is no-we provide an overview here and a detailed discussion in Appendix D.1.Recall that structural constraints ensure that when an arm is considered in isolation, the optimal action will always be to pull, and that a Whittle-index approach computes the infimum subsidy, , an arm requires to accept passivity at time .Whether or not arm  is actually pulled at time  depends on how the subsidy of one arm compares to the infimum subsidies required by other arms.Thus, any modification intended to guarantee time-indexed fairness must be able to alter the ordering among arms, such that any arm  which would otherwise have a subsidy with rank >  when sorted in descending order will now be in the top- arms.Even if we could construct such a modification for a single arm without requiring time-stamped system information, if every arm had this same capability, then a new challenge would arise: we would be unable to distinguish among arms, and arbitrary tie-breaking could again jeopardize fairness constraint satisfaction.

Additional Related Work
While multi-armed bandit problems are canonically framed from the perspective of the decision-maker, interest in individual and group fairness in this setting has grown in recent years [7,14,20].
In the stochastic multi-armed bandit setting, each arm is characterized by a fixed but unknown average reward rather than by an MDP.The decision-maker thus faces uncertainty about the true utility of each arm and must balance exploration (i.e., pulling arms to gain information about their reward distributions) with exploitation (i.e., pulling the optimal arm(s)) to maximize expected reward.Joseph et al. [14] examine fairness among arms in this setting, and introduce a definition that requires the decision-maker to favor (i.e., select) arms with higher average reward over arms with lower average reward, even in the face of uncertainty.As the authors note, this definition is consistent with reward maximization, but imposes a cost in terms of per-round regret when learning the optimal policy, due to the fact that arms with overlapping confidence intervals are chained until they can be separated with high confidence.
Prior work in other non-restless bandit settings demonstrates that alternative definitions-i.e., those which center distributive fairness among arms as opposed to the principle that arms with similar average rewards should be treated similarly [10], generally entail deviation from optimal behavior.Li et al. [20] study the combinatorial sleeping bandit setting, in which arms are stochastic but may be unavailable at any given timestep.They introduce the minimum selection fraction constraint, which we adapt and refer to as time-indexed fairness (see Section 2).Chen et al. [7] consider the contextual bandit setting, and propose an algorithm that guarantees each arm a minimum probability of selection at each timestep.
In the restless setting that we consider, prior works have tended toward opposite ends of the reward-fairness spectrum by either: (1) redistributing pulls without providing arm-level guarantees [19,25]; or (2) guaranteeing time-indexed fairness without providing optimality guarantees [29].Recent work has also considered the adjacent problem of fairness among intervention providers (i.e., workers) [5].In contrast to prior work, we aim to guarantee rather than incentivize fairness, without incurring an exponential dependency on the time horizon or sacrificing optimality guarantees.We thus seek an efficient policy that is reward maximizing, subject to the satisfaction of both budget and probabilistic fairness constraints.

METHODOLOGICAL APPROACH
Here we introduce ProbFair, an approximately optimal solution to a relaxed version of the allocation task in which we guarantee the satisfaction of probabilistic rather than time-indexed fairness, along with the budget constraint.This relaxation is necessary for tractability, as it allows us to precompute a stationary, state-agnostic probability vector, ì   , from which constraint-satisfying discrete actions are drawn.
ProbFair maps each arm  to an arm-specific, stationary probability distribution over atomic actions, such that for each timestep ,  [   = 1] =   and  [   = 0] = 1 −   , where   ∈ [ℓ, ] for all  ∈ [ ] and    = .Here, ℓ and  are user-defined fairness parameters satisfying 0 < ℓ ≤   ≤  ≤ 1, per Section 2. Note that ℓ and  can be interpreted as lower and upper bounds on the expected number of pulls an arm will receive over the time horizon.
In Section 4.1, we describe how to construct the   's so as to efficiently approximate our constrained reward-maximization objective within a multiplicative factor of (1 − ), for any given constant  > 0. We use a dependent rounding approach detailed in Section 4.2 to sample from this distribution at each timestep independently, to produce a discrete action vector, ì   ∈ {0, 1}  , which is guaranteed to satisfy the budget constraint,  [39].
To motivate our approach, note that when we take the union of each arm's stationary probability vector, we obtain a systemlevel policy,    : { |  ∈  } → [1 −   ,   ]  .Regardless of the system's initial state, repeated application of this policy will result in convergence to a steady-state distribution in which (WLOG) arm  is in the adherent state (i.e., state 1) with probability   ∈ [ℓ, ], and the non-adherent state (i.e., state 0) with probability 1 −   ∈ [0, 1].
By definition, for any arm ,   will satisfy the equation: (2) Thus,   =   (  ), where We seek the policy which maximizes total expected reward, where reward is non-decreasing in  (i.e., with time spent in the adherent state).Thus, ProbFair is defined as: Solving this constrained maximization problem is thus consistent with maximizing the expected number of timesteps each arm will spend in the adherent state, subject to satisfying the budget and probabilistic fairness constraints.We emphasize that our construction process takes the transition matrices of each arm  into account via   (Equation 3).

Computing the 𝑝 𝑖 's: Algorithmic Approach
Overview: To construct   , we: (1) partition the arms based on the shapes of their respective   functions (Eq.3); (2) perform a grid search over possible ways to allocate the budget, , between the two subsets of arms; (2a) solve each sub-problem to produce a probabilistic policy for the arms in that subset; (2b) compute the total expected reward of the policy; (3) take the argmax over this set of grid search values to determine the approximately optimal budget allocation; and (4) form    by taking the union over the policies produced by evaluating each sub-problem at its approximately optimal share of the budget.Figure 1
for  ∈ grid_search_vals do P1 is a concave-maximization problem that can be solved efficiently via gradient descent.The computational complexity is [26].To solve P2, we begin by introducing a lemma that we prove in Appendix E: Lemma 4.3.P2 has an optimal solution in which   ∈ (ℓ, ) for at most one  ∈ Y.

Algorithm 2 SolveP2
Note: all sorts are ascending; arrays are zero-indexed.
Proof Sketch.By Lemma 4.3, there exists at most one arm with optimal value  *  ∈ (ℓ, ).
Then, we can rewrite Equation 4 as an optimization problem over set assignment: With our solutions to P1 and P2 so defined, the cost of finding our probabilistic policy in this way is 3 when all  arms are in X.

Sampling Approach
For problem instances with feasible solutions, Algorithm 1 returns   , a mapping from the set of arms to a set of stationary probability distributions over actions, such that for each arm , the probability of receiving a pull at any given timestep is in [ℓ, ].By virtue of the fact that ℓ > 0, this policy guarantees probabilistic fairness constraint satisfaction for all arms.We use a linear-time algorithm introduced by Srinivasan [

EXPERIMENTAL EVALUATION
In this section, we empirically demonstrate that ProbFair enforces the probabilistic fairness constraint introduced in Section 2 with minimal loss in total expected reward, relative to fairness-aware alternatives.We begin by identifying our comparison policies, evaluation metrics, and datasets.We then present results from three experiments: (1) ProbFair versus fairness-inducing alternative policies, holding the cohort fixed and considering fairness-aligned sets of hyperparameters; (2) ProbFair evaluated on a breadth of cohorts representing different types of patient populations; and (3) ProbFair when fairness is not enforced (i.e., ℓ = 0), to examine the cost of state agnosticism. 1

Experimental Setup
Policies: In our experiments, we compare ProbFair against a subset of the following baseline § and fairness-{inducing † , guaranteeing ‡ , and agnostic ★ } policies: Random § Select  arms uniformly at random at each .Round-Robin § , ‡ Select  arms at each  in fixed, sequential order.

TW-based heuristics ‡
Select top- arms based on Whittle index values.Available arms vary based on time-indexed fairness constraint satisfaction [29].
We specifically consider three Threshold Whittle-based heuristics: H First , H Last , and H Rand .These heuristics partition the  pulls available at each timestep into (un)constrained subsets, where a pull is constrained if it is executed to satisfy a time-indexed fairness constraint.During constrained pulls, only arms that have not yet been pulled the required number of times within a -length interval are available; other arms are excluded from consideration, unless all arms have already satisfied their constraints.H First , H Last , and H Rand position constrained pulls at the beginning, end, or randomly within each interval of length , respectively.Appendix F.1 provides pseudocode.
Objective: In all experiments, we assign equal value to the adherence of a given arm over time.Thus, we set our objective to reward occupancy in the "good" state: a simple local reward   (   )    ∈ {0, 1} and undiscounted cumulative reward function, ( ()) ∈ [ ]  ∈ [ ]  (   ).Evaluation metrics: We are interested in comparing policies along two dimensions: reward maximization and fairness (i.e., with respect to the distribution of algorithmic attention).To this end, we rely on two performance metrics: (a) intervention benefit and (b) earth mover's distance.
Intervention benefit (IB) is the total expected reward of an algorithm, normalized between the reward obtained with no interventions (0% intervention benefit) and the asymptotically optimal but fairness-agnostic Threshold Whittle algorithm (100%) [24].Formally, Per Lemma F.2 (App.F.2), the price of fairness (PoF) metric [4] is inversely proportional to intervention benefit.We thus report IB.
Earth mover's distance (EMD) is a metric that allows us to compute the minimum cost required to transform one probability distribution into another [36].We use it to compare algorithms with respect to fairness-i.e., how evenly a set of pulls are allocated among arms.(Other metrics that may measure individual distributive fairness are discussed in Appendix F.2.) For each algorithm, we consider a discrete distribution  of observed pull counts, where each bucket,  ∈ {0 . . . }, corresponds to a feasible number of total pulls that an arm could receive, and  [ ] ∈ {0 . . . } corresponds to the number of arms whose observed pull count is equal to .For example,  [0] corresponds to the quantity of arms never pulled, and  [ ] corresponds to the quantity of arms pulled at every timestep.Each algorithm produces  total pulls, so the distributions have the same total mass.
We use Round-Robin as a fair reference algorithm since it distributes pulls evenly among arms.We then compute the minimum cost required to transform each algorithm's distribution,  ALG , into that of Round-Robin's,  RR .
For our application this is equivalent to: Unless otherwise noted, we normalize EMD such that the maximum distance we encounter, that of TW, is one: Datasets: We evaluate performance on two datasets: (a) a realistic patient adherence behavior model and (b) a general set of randomly generated synthetic transition matrices.CPAP Adherence.Obstructive sleep apnea (OSA) is a common condition that causes interrupted breathing during sleep [30]; when used throughout the entirety of sleep, continuous positive airway pressure therapy (CPAP) eliminates nearly 100% of obstructive apneas for the majority of treated patients [37].However, poor adherence behavior in using CPAP reduces its beneficial outcomes.CPAP non-adherence affects an estimated 30-40% of patients [35].
We derive the CPAP dataset that we use in our experiments from the work of Kang et al. [16,17], who model the dynamics and patterns of patient adherence behavior as a basis for designing effective and economical interventions.In particular, we adapt their Markov model of CPAP adherence behavior (a three-state system based on hours of nightly CPAP usage) to a two-state system using the clinical standard for adherence-at least four hours of CPAP machine usage per night [37].Kang et al. [16] find, via expectationmaximization on CPAP usage patterns, that patients can be divided into two groups based on this clinical standard.Though patients in the first cluster occasionally miss a night, these patients utilize a CPAP machine for more than four hours every night without assistance, while patients in the second cluster do not.We refer to the latter cluster as the non-adherent cohort in our analysis.
Kang et al. [17] consider many intervention effects.We specifically consider an intervention effect,  interv = 1.1, that broadly characterizes supportive interventions such as telemonitoring and phone support, which are associated with a moderate 0.70 hours (95% CI ± 0.35) increase in device usage per night [1].We add random  = 1 logistic noise to the transition matrices so that there is some variance in individual arm dynamics.To prevent overlap with the general cohort we consider for contrast, added noise can only hinder the probability of adherence in the non-adherent cohort.
Synthetic.In addition, we construct a synthetic dataset of randomly generated arms such that the structural constraints outlined in Section 2 are preserved.We conjecture that forward (reverse) threshold-optimal arms are a subset of concave (strictly convex) arms (see Appendix F.3).

ProbFair vs. Fairness-aware Alternatives
Here we compare ProbFair to policies which either induce or guarantee fairness.The former includes Risk-aware Whittle (RA-TW), which incentivizes fairness via concave reward  () [25].We use the authors' suggested reward function  () = −  (1−) ,  = 20.This imposes a large negative utility on lower belief values, which motivates preemptive intervention.However, RA-TW does not guarantee time-indexed or probabilistic fairness for individual arms.The latter includes Round-Robin and the First, Last, and Random heuristics, which guarantee time-indexed fairness but do not provide any optimality guarantees.
In Table 1, we report average results for each policy, along with margins of error for 95% confidence intervals, computed over 100 simulation seeds for a synthetic cohort of 100 collapsing arms, with  = 20 and  = 180.To facilitate meaningful comparisons between ProbFair and the heuristics, we consider combinations of values for ℓ and  that produce equivalent, integer-valued lower bounds on the number of pulls any arm can expect to receive-i.e., min

ProbFair on a Breadth of Cohorts
In this section we conduct sensitivity analysis with respect to cohort composition.For each dataset, we identify a transition matrix characteristic that can be modified during the generation process to produce a subset of arms that will exhibit less favorable transition dynamics than their peers.For the synthetic dataset, this characteristic is strict convexity.For the CPAP dataset, it is non-adherence, a mnemonic coined by Kang et al. [16] to characterize a cluster of study participants, and contrast this to a model fit on the general patient population.
For each dataset, we generate ten different cohorts, each of which is characterized by the percentage of unfavorable arms that it contains.We use a seed to control the generation process such that each cohort contains 100 collapsing arms in total.A sliding window of the unfavorable arms we can generate with this seed are included as we increase the cardinality of the unfavorable subset.For ease of interpretation, we present unnormalized results over 100 simulation seeds with  = 20 and  = 180 in Figure 2, and then proceed to summarize normalized performance.Key findings from this experiment include: • Per Figure 2, for each dataset, expected total reward predictably declines for all policies as the percentage of unfavorable arms increases, while unnormalized EMD increases for TW and ProbFair.
-Synthetic: As the proportion of strictly convex arms increases, ProbFair's allocation of resources tends towards the bimodality of TW. -CPAP: As the proportion of non-adherent arms increases, the level of intervention required to improve trajectories rises, but the budget constraint is static.• For each dataset, ProbFair's normalized performance remains stable even as cohort composition is varied: -Synthetic: With respect to IB (EMD), ProbFair achieves an average (over all cohorts) of averages (over 100 simulations per cohort) of 80.69% ± 1.42% (58.98% ± 1.29%).-CPAP: The values for IB (EMD) are: 79.84% ± 0.68% (59.68% ± 1.08%).

ProbFair: Price of State Agnosticism
Here, we investigate the cost associated with ProbFair's state agnosticism, relative to state-aware Threshold Whittle.To ensure a fair comparison, we set ℓ = 0 and  = 1, effectively constructing a version of ProbFair in which probabilistic fairness is not enforced.
(Recall that TW is fairness-agnostic; in the previous results, we do not expect ProbFair to obtain the same total reward as TW).
Although ProbFair incorporates each arm's structural information (i.e., transition matrices), it produces a set of stationary probability distributions over actions from which all discrete actions are subsequently drawn.TW, in contrast, ingests each arm's current state at each timestep, and is thus able to exploit realized sequences of state transitions.
While we thus expect ProbFair to incur some loss in intervention benefit, our results (computed over 100 simulation seeds, with  = 20,  = 100, and  = 180) indicate that this loss is acceptable rather than catastrophic.Relative to TW, ProbFair ℓ=0 obtains 97.41% ± 0.26 of E[IB] and incurs an increase of only 4.56% ± 0.19 with respect to E[EMD].

CONCLUSION AND FUTURE WORK
In this paper, we introduce ProbFair, a novel, probabilistically fair algorithm for constrained resource allocation.Our theoretical results prove that this policy is reward-maximizing, subject to the guaranteed satisfaction of both budget and tunable probabilistic fairness constraints.Our empirical results demonstrate that ProbFair preserves utility while providing fairness guarantees.Promising future directions include: (1) extending ProbFair to address larger state and/or action spaces; and (2) relaxing the requirement for stationarity in the construction of    .

A NOTATION
In Table 2, we present an overview of the notation used in the paper.[ ] denotes the set {1, 2, . . .,  }.

Policy function 𝜋 : S → A
A policy for actions.
The set of optimal policies is  * ∈ Π * . arms can be pulled at any timestep .

Timestep
Objective Functions The objective is to find a policy  * = max  E  [(•)].

Fairness-motivated Constraint Functions
Integer periodicity Guarantees arm  is pulled at least once within each period of  timesteps.

B EMPIRICAL INEQUITY IN THE DISTRIBUTION OF ACTIONS UNDER WHITTLE INDEX POLICIES
Here, we present numerical results confirming Prins et al. [29]'s findings that Threshold Whittle (TW) tends to allocate pulls according to a bimodal distribution: a small subset of arms are pulled frequently, while others are largely ignored.
Experimental Setup: For each iteration, we generate  = 2 forward threshold-optimal arms and run TW for a  = 365 horizon simulation, where the budget constraint  = 1.We run 1, 000 such iterations.
Results: In 515 out of 1,000 (51.5%) simulations, the arms' Whittle indices never overlap, meaning that for any combination of initial states, state transitions, and pulls, TW would pull one arm for all timesteps  ∈  and completely ignore the second arm.We visualize one such case in Figure 3. Recall that TW precomputes the infimum subsidy  per arm and belief combination.Since belief is a function of last known state  ∈ {0, 1} and time-since-seen  ∈ [ ] (using the notation of Mate et al. [24]), we plot the infimum subsidy of each arm-state combination with time-since-seen, , on the -axis.There exists a horizontal line that divides the two arms, so arm  = 1 will be pulled for every timestep and arm  = 2 will never be pulled.

7LPHVWHSV :KLWWOHLQGH[YDOXHV $UPV $UPV $UPV $UPV
Figure 3: The Whittle index values for Arm 1 and 2 can be separated by a horizontal line, meaning that (WLOG) Arm 1 will always be chosen over Arm 2 because its index value dominates.
In order to modify the Whittle index to guarantee time-indexed fairness constraint satisfaction, one would need to ensure that no such horizontal line exists.Additionally, if we consider a specific form of time-indexed fairness known as an integer periodicity constraint, which allows a decision-maker to guarantee that arm  is pulled at least once within each period of  days, the lines associated with the arms in Figure 3 must cross before  timesteps elapse to guarantee fairness constraint satisfaction.
Another perspective we can take is to ask: what's the smallest interval   for each arm  we could have specified such that Threshold Whittle would have satisfied the integer periodicity constraint?Note that this is retrospective, as there is no way to enforce this constraint at planning time.We visualize the minimum such   in Figure 4. On the far right, we see the 515 cases where (WLOG) the second arm is never pulled-that is, the minimum   such that Threshold Whittle satisfies the hard integer periodicity constraint must be larger than &RXQW $UP $UP 0LQLPXP i VDWLVILHG the horizon,  = 365.There is one case where arm  = 2 is pulled exactly once.In a majority of the remaining simulations, TW pulls each arm with approximately equal frequency.

C MOTIVATING A FOCUS ON DISTRIBUTIVE FAIRNESS
In Section 3.2 and Appendix B, we discuss and empirically demonstrate how Whittle index-based policies produce bimodal distributive outcomes in which a small subset of arms receive all available interventions, and the rest of the arms are ignored.Here, we note that there are many application domains where such skewed resource allocation may be perceived as unfair or undesirable by beneficiaries and decision-makers, thus motivating efforts to incentivize or guarantee distributive fairness.
In the healthcare examples we present in the main paper (i.e., where interventions support medication or CPAP adherence), resource constraints and variation in transition dynamics interact.A practical consequence is that a majority of patients will never receive the beneficial intervention(s).This, in turn, means that their clinical outcomes will be strictly worse in expectation than they would be under a policy that guaranteed a non-zero probability of receiving the intervention at each timestep.
These considerations are not restricted to healthcare contexts.In poaching prevention, the planner must be cognizant of the effect of choosing against patrolling a particular area (i.e., pulling an arm) will have on poacher strategies [31].In wireless scheduling, multiple processes compete to transmit packets over a shared wireless channel.When scheduling the transmissions, the agent must not only maximize reward but also ensure the performance of critical applications codified in Quality of Service (QoS) guarantees [13,20,22].

D INTRACTABILITY OF ALTERNATIVE APPROACHES
In this section, we motivate the algorithmic design choices we have made when constructing ProbFair by discussing the feasibility of possible alternatives, including: (1) direct modification of the Whittle index to guarantee time-indexed fairness constraint satisfaction, and (2) a math programming-based approach.

D.1 Why not modify the Whittle index to guarantee time-indexed fairness constraint satisfaction?
In Section 3.3, we demonstrate that it is not possible to guarantee time-indexed fairness when arms are decoupled.If arms cannot be decoupled, the tractability of a Whittle index-based approach breaks down.Here, we discuss this topic in greater detail, and provide specific examples of possible Whittle index modifications.We also take this opportunity to emphasize that our focus is on guaranteeing fairness rather than incentivizing it.Mate et al. [25] provide an example of the latter, which we discuss and include as a comparison algorithm in Section 5.2.
To begin, recall that the efficiency of Whittle index-based policies stems from our ability to decouple arms when we are only concerned with maximizing total expected reward [41,42].However, guaranteeing time-indexed fairness (as defined in Section 2) in the planning setting requires time-stamped record keeping.It is no longer sufficient to compute each arm's infimum subsidy in isolation and order the resulting set of values.Instead, for an optimal index policy to be efficiently computable, it must be possible to modify the value function (Equation 3.1) so as to ensure that the infimum subsidy each arm would require in the absence of fairness constraints is minimally perturbed via augmentation or "donation", so as to maximize total expected reward while ensuring its own fairness constraint satisfaction or the constraint satisfaction of other arms, respectively, without requiring input from other arms.
Plausible modifications include altering the conditions under which an arm receives the subsidy associated with passivity, , or introducing a modified reward function,  ′ () that is capable of accounting for an arm's fairness constraint satisfaction status in addition to its state at time .For example, we might use an indicator function to "turn off" the subsidy until arm  has been pulled at least once within the interval in question, or increase reward as an arm's time-since-pulled value approaches the interval cut-off, so as to incentivize a constraint-satisfying pull.When these modifications are viewed from the perspective of a single arm, they appear to have the desired effect: if no subsidy is received, it will be optimal to pull for all belief states; similarly, for a fixed , as reward increases it will be optimal to pull for an increasingly large subset of the belief state space.
Recall, however, that structural constraints ensure that when an arm is considered in isolation, the optimal action will always be to pull.Whether or not arm  is actually pulled at time  depends on how the infimum subsidy, , it requires to accept passivity at time  compares to the infimum subsidies required by other arms.Thus, any modification intended to guarantee time-indexed fairness constraint satisfaction must be able to alter the ordering among arms, such that any arm  which would otherwise have a subsidy with rank >  when sorted in descending order will now be in the top- arms.Even if we were able to construct such a modification for a single arm without requiring system state, if every arm had this same capability, then a new challenge would arise: we would be unable to distinguish among arms, and arbitrary tie-breaking could again jeopardize fairness constraint satisfaction.
If it is not possible to decouple arms, then we must consider them in tandem.Papadimitriou and Tsitsiklis [28] prove that the RMAB problem is PSPACE-complete even when transition rules are action-dependent but deterministic, via reduction from the halting problem.

D.2 Why not use a math-programming approach?
Our constrained maximization problem can be readily formulated as an integer program (IP) with a totally unimodular (TU) constraint matrix.However, this approach is intractable because the objective function coefficients of this IP cannot be efficiently enumerated.To support this intractability claim, we begin by presenting an integer program (IP) that maximizes total expected reward under both budget and time-indexed fairness constraints, for problem instances with feasible hyperparameters.We then prove that any problem instance with feasible hyperparameters yields a totally unimodular (TU) constraint matrix, which ensures that the linear program (LP) relaxation of our IP will yield an integral solution.We proceed to demonstrate that tractability issues arise because we incur an exponential dependency on the time horizon,  , when we construct the IP's objective function coefficients.We conclude by comparing ProbFair to the IP for small values of  and  .

D.2.1 Integer Program Formulation.
To leverage a math programming approach for our constrained reward maximization task, we seek to construct an integer program (IP) whose solution is the policy ì  ∈ {0, 1}  | A | .We require this policy to be reward-maximizing, subject to the guaranteed satisfaction of both budget and time-indexed fairness constraints.To begin, let each decision variable  ,, ∈ {0, 1} represent whether or not we take action  ∈ A = {0, 1} for arm  ∈ [ ] at time  ∈ [ ].Then, let each objective function coefficient  ,, represent the expected reward associated with an arm-action-timestep combination.
To formalize the objective function, recall that the agent seeks to maximize total expected reward, E  [(•)].For clarity of exposition, we specifically consider the linear global reward function ( ()) =  =1   =0    .Note that this implies the discount rate,  = 1; however, the approach outlined here can be extended in a straightforward manner for  ∈ (0, 1).In order to compute the expected reward associated with taking action  for arm  at time , we must consider: (1) Within the context of this IP, the time-indexed fairness constraint we introduce in Section 2 can be more specifically defined as either an integer periodicity or minimum selection fraction constraint.We formalize each of these below: The integer periodicity constraint allows a decision-maker to guarantee that arm  is pulled at least once within each period of  days.We define this constraint as a function , over the vector of actions, ì   associated with arm , and user-defined interval length  ∈ [1, ]: The minimum selection fraction constraint introduced by Li et al. [20] forces the agent to pull arm  at least a minimum fraction,  ∈ (0, 1), of the total number of steps, but is agnostic to how these pulls are distributed over time.We define this constraint,  ′ , as a function over the vector of actions, ì   associated with arm  and user-defined  : The resulting integer program is given by: max    ( We now prove that the IP we have formulated in Section D.2.1 has an attractive property: namely, any feasible problem instance will produce a totally unimodular constraint matrix.Our proof leverages a theorem introduced by Ghouila-Houri and restated below for convenience, which can be used to determine whether a matrix,  ∈ R × is totally unimodular: Lemma D.1.(Ghouila-Houri [11]) A matrix  ∈ Z × is totally unimodular (TU) if and only if for every subset of the rows  ⊆ [], there is a partition Theorem D.2.Within the context of the integer program outlined in Appendix D.2.1, any feasible problem instance will produce a constraint matrix that is totally unimodular (TU).
Proof.To begin, we establish the dimensions of any such constraint matrix A and note the maximum possible column-wise sum that each of its component submatrices may contribute.Note that the minimum selection fraction constraint (b.ii in Equation 16), which requires the agent to pull each arm  at least a minimum fraction,  ∈ (0, 1), of  rounds, can be thought of as a special case of the integer periodicity constraint, (b.i in Equation 16), where  =  and each arm must be pulled at least ⌈ ⌉ times.As such, we assume that at most one of the time-indexed fairness constraints can be selected, and focus on the more general of the two, which is the integer periodicity constraint.For notational convenience, we refer to constraints by their alphabetic identifiers.Let (b) represent the integer periodicity constraint, and define a function φ :  ∈ R ⊆ A ↦ →  ∈ {, , } that maps each row to its corresponding constraint type.
First, recall that each  ,, represents a single binary decision variable, and corresponds to a column in A. There are  × |A| ×  such columns.Next, note that constraint (a) enforces the requirement that we select exactly one action per arm per timestep.Formally, ∀,  ∈  × , ∃! ∈ A s.t. ,, = 1.Correspondingly, ∀ ′ ∈ A \ ,  , ′ , = 0.The column vectors of the associated sub-matrix, A  ∈ Z  × | A | , are indexed by disjoint (, , ) ∈  × |A| × ; thus, each column vector contains a single non-zero entry and for R  ⊆ A  , taking the column-wise sum will yield a vector ì  ∈ Z  | A | with every entry equal to 1.In a similar vein, equity constraint (b) enforces the requirement that we must pull each arm,  at least once during each interval   of length   .Within the associated sub-matrix, each column that corresponds to a passive action (e.g.,  ,=0, ) will have only zero-valued entries, since passive action decision variables are not impacted by constraint (b).Conversely, each column that corresponds to an active action (e.g.,  ,=1, ) will have a single non-zero entry.Each active action column corresponding to a specific arm-timestep can be mapped to exactly one interval.Thus, for R  ⊆ A  , taking the column-wise sum will yield a vector ì  ∈ Z  | A | with every entry taking a value ∈ {0, 1}.
The budget constraint (c) enforces the requirement that we must pull exactly  of the  arms at each timestep.Much like equity constraint (b), only columns corresponding to active actions are impacted.Thus, within the associated sub-matrix, A  ∈ Z  × | A | , each column that corresponds to a passive action (e.g.,  ,=0, ) will have only zero-valued entries, while each column that corresponds to an active action (e.g.,  ,=1, ) can be mapped to a single timestep, and will have a single non-zero entry.Thus, for R  ⊆ A  , taking the column-wise sum also yield a vector ì  ∈ Z  | A | with every entry taking a value ∈ {0, 1}.The complete constraint matrix A thus contains  +  ⌈    ⌉ +  rows.Three possible cases arise when we consider every subset of these rows: ⊲ Initialize two sets of counters 3: for element  ∈ {, , } do for  ∈ R  do 7: for  ∈ 0 :  |A| do 9: if  [] > 0 then ⊲ For each non-zero entry ∈  10: for  ∈ R  do 12: flag ← false 13: for  ∈ 0 :  |A| do 14: if flag ← true ⊲ Set flag to true 16: if flag then 17: for  ∈ 0 :  |A| do 19: if  [] > 0 then ⊲ For each non-zero entry ∈  20: for  ∈ 0 :  |A| do 24: if  [] > 0 then ⊲ For each non-zero entry ∈  25: for  ∈ R  do 27: for  ∈ 0 :  |A| do 29: if  [] > 0 then ⊲ For each non-zero entry ∈  30: Taking the column-wise sums of the resulting R 1 and R 2 will yield two vectors ì  1 , ì  2 ∈ Z  | A | , which can contain entries ∈ {0, 1} and {0, 1, 2}, respectively.Note that since ì  2 is constructed by taking only rows with constraint types ∈ {, }, only entries corresponding to active action columns can take values > 1.Moreover, ∀ ∈

Enumeration of Objective Function Coefficients.
The key challenge we encounter when we seek to enumerate the IP outlined in Section D.2.1 is that exact computation of the objective function coefficients, ì  ∈ R  | A | is intractable.Each arm contributes |A| ×  coefficients, and while calculation is trivially parallelizable over arms, we must consider a probability tree like the one in Figure 5 for each arm.
The number of decision variables required to enumerate each arm's game tree is of order  (|A||S|  ) and there are  such trees, so even a linear program (LP) relaxation is not tractable for larger values of  and  , which motivates us to propose ProbFair (Section 4) as an efficient alternative.

D.2.4
Comparison of ProbFair with the True Optimal Policy.In Section 5, we normalize intervention benefit with Threshold Whittle, which is asymptotically optimal for forward threshold-optimal transition matrices under a budget constraint  [24].However, with the integer program (IP) we formulate in Section D.2.1, we can find the optimal policy for any set of transition matrices under budget and fairness constraints as long as  and  are small.We generate  = 2 random arms such that the structural constraints outlined in Section 2 are satisfied.We set  = 1 and  = 6.Though the variance in reward is large due to the small  , Figure 6 shows that ProbFair obtains 100% of the intervention benefit when no fairness constraints are applied.Similarly, ProbFair with ℓ = 0.33 obtains the same adherence behavior as the IP policy with under hard fairness constraint  = 3 or minimum selection fraction constraint  = 0.33.(within 95% confidence interval shown).All results shown are bootstrapped over 500 iterations.
Minimum Selection Fraction Constraints.As we discuss in Appendix B, the optimal policy is often to pull the same  arms at every timestep and ignore all other arms.Under minimum selection fraction constraints (Equation 15), each arm must be pulled at least a minimum fraction  of  rounds, with no conditions on when these pulls should take place.We confirm with the optimal IP implementation our intuition that these additional pulls are allocated at the beginning or end of the simulation.That is, the optimal policy under minimum selection fraction constraints is to take advantage of the finite time horizon, which is not suitable for the applications we consider.

E PROBFAIR: A PROBABLISTICALLY FAIR POLICY
Our main contribution is the novel probabilistic policy ProbFair (see Section 4.1).Here, in Section E.1, we present complete proofs to Theorems 4.1-4.4.Then, in Section E.2, we provide additional details about how we sample from our probabilistic policy to select discrete actions at each timestep.

E.1 Proofs
In this section, we provide proofs for the theorems introduced in Section 4.1.When relevant, we begin by restating the theorem for convenience.Proof.For notational convenience, let: Thus, The sign of  ′′  (  ) is the same as the sign of  =  1 −  Proof.For notational convenience, let: Observe Per our structural constraints,  0 1,1 < To prove the upper bound, observe that:  P2 is then equivalent to finding a partition Y → Y 1 ∪ Y 2 ∪ Y 3 which maximizes the following: Subtracting the constant  ∈Y   (ℓ) and simplifying yields: Thus, setting Y * 3 = Y ′ 3 reduces the optimization problem in Eq. 22 to finding a partition over the remaining sets Y 1 and Y 2 .At worst, Algorithm 2 requires two sorts: once on Line 5, and a second time on Line 8, for a total computational cost of  (2|Y| log |Y|).
In total, the computational cost of Algorithm 1 is at worst    3 when all  arms are in X.

E.2 Dependent Rounding Sampling Approach
Here we provide pseudocode for the sampling algorithm introduced in Section 4.2, along with its associated Simplify subroutine [39].

F ADDITIONAL EXPERIMENTAL DETAILS
In this section, we discuss additional details of our empirical study in Section 5. We provide a description and pseudocode of the heuristic policies (F.1), discuss our choice of fairness metric (F.2), and provide additional details of our Synthetic dataset (F.3).Code and instructions needed to reproduce these experimental results are included at: https://github.com/crherlihy/prob_fair_rmab.All results presented in this paper are bootstrapped over 100 simulation iterations, with a time horizon  = 180, cohort size  = 100, and budget  = 20, unless otherwise noted.We utilize seeds to ensure reproducible variation for each randomized parameter, including actualized transitions in each simulation.We have run simulations on an Intel(R) Core i7 CPU with 16Gb of RAM.Simulations are configurable via configuration files; runs are trivially parallelizable via these configuration files.

F.1 Heuristic Algorithms
In Section 5, three heuristics based on the Threshold Whittle algorithm are introduced: H First , H Last , and H Rand Here, we go into more detail and provide pseudocode.
Definition F.1.Within the context of Algorithm 6, we define a constrained pull to be one that is executed to satisfy an integer periodicity constraint.Only arms that have not yet been pulled the required number of times within the -length interval are available; other arms are excluded from consideration, unless all arms have already satisfied their constraints.In this case, all arms are available to be pulled.If a pull is not constrained, we say it is unconstrained or residual.The H First heuristic requires that all constrained pulls must occur at the start of the interval.This implies that the first  / timesteps in each interval are dedicated to pulling all  arms.
The H Last heuristic requires that all constrained pulls must occur at the end of the interval.Unlike the H First heuristic, not all arms will necessarily be pulled in the last  / timesteps, as some arms will have already satisfied their constraint earlier in the interval via unconstrained pull(s).These leftover constrained pulls function as unconstrained pulls, per Definition F.1.
The H Rand heuristic chooses random positions within the interval for constrained pulls to occur.Similarly to the H Last heuristic, some of the later constrained pulls may become unconstrained pulls if all arms have already satisfied their constraint earlier in the interval.for  ∈  do 10: ← GetInterval()

F.2 Fairness Metric Choices
It is not immediately obvious which evaluation metric(s) best indicate whether we have improved distributive fairness.While constraint satisfaction itself is a logical candidate, it is Boolean-valued at the arm level, and thus does not reflect to what extent a policy fairly allocates pulls.Even if we were to report population-level constraint satisfaction (i.e., by noting the proportion of arms for which a given fairness constraint is satisfied, either over the course of a single simulation, or in expectation over a set of simulation iterations), this would be tautologically biased in favor of ProbFair and the Threshold Whittle-based heuristics, which explicitly encode constraint satisfaction.This observation motivates us to consider proxy metrics, including the price of fairness (PoF), the Herfindahl-Hirschman Index (HHI), and the earth mover's distance (EMD).
Price of Fairness.Consider price of fairness, defined formally as: Price of fairness is the relative loss in total expected reward associated with following a distributive fairness-enforcing policy, as compared to Threshold Whittle [4].A small loss (∼ 0%) indicates that fairness has a small impact on total expected reward; conversely, a large loss means total expected reward is sacrificed in order to satisfy the fairness constraints.
Lemma F.2. Price of fairness is inversely proportional to intervention benefit.
Proof.The statement in Lemma F.2 is equivalent to the statement "Given ,  > 0, there exists  ∈ R such that □ Herfindahl-Hirschman Index (HHI)..The Herfindahl-Hirschman Index (HHI) [34], is a statistical measure of concentration useful for measuring the extent to which a small set of arms receive a large proportion of attention due to an unequal distribution of scarce pulls [12].It is defined as: HHI ranges from 1/ to 1; higher values indicate that pulls are concentrated on a small subset of arms.However, HHI is an imperfect evaluation metric for addressing our prioritarian concern for arms that would be deprived of algorithmic attention (i.e., fail to receive any pulls) under Threshold Whittle (see Appendix B).Since entries are squared, reducing  offers a more direct path to lowering HHI than increasing ℓ.However, reducing  will not accomplish our stated goal of guaranteeing each arm a strictly positive lower bound on the probability that it will receive a pull at any given timestep.Earth Mover's Distance.The earth mover's distance (EMD), or Wasserstein metric, is a measure of distance between two distributions.Specifically, we measure the distance of an algorithm's distribution of cumulative pull allocations to a fair reference distribution, Round-Robin.Though differences in distances are meaningful, EMD does not directly map to our fairness desiderata.That is, a given level of fairness enforcement (e.g., as characterized by the hyperparameters ℓ or ) is not associated with a specific range of EMD values.Hence, our discussion of (normalized) earth mover's distances in Section 5 focuses on relative differences between policies.

F.3 Synthetic Dataset
Conjecture F.3.The set of forward (reverse) threshold-optimal arms are a subset of the set of concave (strictly convex) arms for the local reward function we consider,  () = .

Figure 2 :
Figure 2: Expected total reward (left) and unnormalized earth mover's distance (right) on a breadth of cohorts.

Figure 4 :
Figure 4: The smallest interval   such that TW satisfies an integer periodicity definition of time-indexed fairness, given  = 2 random arms.In over 50% of iterations, no such fairness constraint satisfaction is possible (i.e., ∃ s.t.  >  ).

Figure 6 :
Figure 6: Adherences of ProbFair, compared to the IP formulation

HHI
For every arm  ∈ [ ],   (  ) is either concave or strictly convex in all of   ∈ [0, 1].Proof Sketch.WLOG, fix an arm  ∈ [ ].For notational convenience, let us define the following constants derived from the arm's transition matrix visualizes; the remainder of this section provides technical details. Figure 1: ProbFair: constructing and sampling from   To begin, we introduce two theorems (see App. E for full proofs): Theorem 4.1.
which does not depend on   .□ Theorem 4.2.For each arm  ∈ [ ], the structural constraints introduced in Section 2 ensure that   (  ) is monotonically non-decreasing in   over the interval [0, 1].Theorem 4.2 follows directly from the second derivative. 1 ,  2 ,  3 , and  4 are constants.
(1) and detailed in Appendix E.2 to sample from   at each timestep, such that the following properties hold:(1)with probability one, we satisfy the budget constraint by pulling exactly  arms; and (2) any given arm  is pulled with probability   .Formally, each time we draw a vector of binary random variables ( 1 ,  2 . . .  ) from the distribution    , Pr [| :

Table 1 :
Expected intervention benefit and normalized earth mover's distance by policy and fairness bracket.
Key findings from this experiment include:• Fairer hyperparameter values (i.e, ℓ ↑,  ↓), correspond to decreases in E[IB] and E[EMD], reflecting improved individual fairness at the expense of total expected reward.• ProbFair is competitive with respect to RA-TW, outperforming on both metrics when ℓ = 0.056, and incurring a slight loss in E[IB] but improvement in E[EMD] for ℓ = 0.1.• For each (ℓ, ) combination, ProbFair performs competitively with respect to the best-performing heuristic (which, like TW, are state-aware, see Section 5.4).
what state is the arm currently in (i.e., what is the realized value of    ∈ {0, 1})? (2) when the arm transitions from   to   +1 by virtue of taking action , what reward,  (•), should we expect to earn?Because we define  () = , (2) can be reframed as: what is the probability  (  +1 = 1|   ,    ) that action  causes a transition from   to the adherent state?Because each arm's state at time  is stochastic, depending not only on the sequence of actions taken in previous timesteps, but the associated set of stochastic transitions informed by the arm's underlying MDP, each coefficient of our objective function must be computed as the expectation over the possible values of   ∈ S: .