Ranking with Long-Term Constraints

The feedback that users provide through their choices (e.g., clicks, purchases) is one of the most common types of data readily available for training search and recommendation algorithms. However, myopically training systems based on choice data may only improve short-term engagement, but not the long-term sustainability of the platform and the long-term benefits to its users, content providers, and other stakeholders. In this paper, we thus develop a new framework in which decision makers (e.g., platform operators, regulators, users) can express long-term goals for the behavior of the platform (e.g., fairness, revenue distribution, legal requirements). These goals take the form of exposure or impact targets that go well beyond individual sessions, and we provide new control-based algorithms to achieve these goals. In particular, the controllers are designed to achieve the stated long-term goals with minimum impact on short-term engagement. Beyond the principled theoretical derivation of the controllers, we evaluate the algorithms on both synthetic and real-world data. While all controllers perform well, we find that they provide interesting trade-offs in efficiency, robustness, and the ability to plan ahead.


INTRODUCTION
Optimizing search and recommendation platforms based on feedback that users provide through their choices (e.g., clicks, purchases) has led to great improvements in ranking quality.However, myopically training systems based on choice data may only improve short-term engagement, but not the long-term sustainability of the platform and the long-term benefits to its users, content providers, and other stakeholders [22].In particular, platforms operate as part of a complex socio-technical system, and many have argued how such AI systems can amplify misinformation [17], harm supply through rich-get-richer dynamics [33], incentivize spam [25], or perpetuate human biases [32].
In this complex space of problems and competing interests, we argue that improved tools for explicitly steering the long-term dynamics of the platform are needed.These tools should enable decision-makers to specify long-term goals for the search and recommendation algorithms beyond short-term engagement maximization.While approaches using reinforcement learning have the potential to directly optimize long-term goals, it is challenging to apply them to complex and large-scale information retrieval settings [1,30].End-to-end frameworks obscure important decision points [20], leading to problems like a lack of reproducibility [4,13,21], reward hacking [16,[37][38][39]42], and user tampering and manipulation [8,14,15,29].Instead, we argue that providing designers with a novel macroscopic view will enable strategic reasoning about long-term platform dynamics and that it will enable new tools for steering the platform.The key algorithmic challenge lies in bridging the gap between long-term goals at a macro-level that span many requests, and the micro-level goal of maximizing engagement for each individual request.
In this paper, we develop a new class of macro-level interventions for steering the long-term dynamics of AI platforms, as well as the mechanisms for optimally executing these macro-level interventions.In our framework, further described in Sections 1.1 and 2, macro-level interventions take the form of exposure or impact targets over substantial periods of time (e.g., days, weeks, months).The macro-level interventions can come from various decision-makers, including the users themselves (e.g., "I want to buy at least 30% local products next month" on an e-commerce platform), the platform operator (e.g., "promote local music communities by serving at least 50% local artists on average" on the Localify!music platform [31]), or regulators (e.g., the recent settlement between Meta and the Department of Housing and Urban Development (HUD) [44] that requires Meta to ensure that each housing-related ad is shown with demographic parity to all protected groups over the ad's lifetime).All such macro-level goals steer the behavior of the system over the course of many requests.This creates complex interactions between individual requests, their short-term metrics, and the long-term goals [35].
We address the key technical problem of designing algorithms which break down macro-level goals into a sequence of individual rankings that least hurt the micro-level metric (e.g., engagement).We view these algorithms as controllers which drive the value of macro-level metrics towards specified targets while responding to incoming requests in real time.In Section 3, we rigorously derive three controllers.The first is a baseline approach that satisfies the macro-level goals at a high cost to the micro-level utility.The second enables a finer trade-off between macro-and micro-level objectives.The final controller incorporates planning to handle requests coming from non-stationary distributions with temporal patterns.To clarify the design of these controllers and their affordances and limitations, we make interesting novel connections between ranking and concepts from online stochastic optimization [3] and model-predictive control [7].Furthermore, we evaluate the controllers on a number of synthetic and real-world datasets in Section 4, which provides practical guidance on when the use of each controller is most appropriate.

Motivation & Related Work
We argue that one of the key challenges in steering the long-term dynamics of AI platforms results from a mismatch in time scales.Algorithms on these platforms typically optimize metrics pertaining to individual requests or sessions, while the dynamics we aim to control play out over weeks or months of repeated interactions.Optimization on a per-request basis is ill-suited for even expressing long-term objectives, much less for steering their dynamics.Instead, we argue that we need a novel macroscopic view to enable strategic reasoning about the long-term dynamics of the platform in addition to the microscopic view that our methods currently focus on.
Existing work on incorporating long-term goals in search and recommendation has largely focused on fairness.Early works defining fairness criteria posed them as constraints on impact or exposure to be fulfilled within a single ranking [10,41,51,51].Later work introduced a temporal perspective, including Celis et al. [9] who develop an online algorithm for recommending diverse viewpoints and Morik et al. [36], Usunier et al. [45] who present algorithms for satisfying fairness cumulatively over multiple rankings.More recently, reinforcement learning algorithms have been applied to longterm fairness in order to handle endogenous dynamics-i.e., the impact of a ranking decision on future utilities and constraints [19,49,53].Beyond fairness, long-term exposure constraints have been motivated as an optimal strategy under the endogenous dynamics of content-provider viability [34,52].Many of these settings fit into the framework that we propose.However, unlike approaches which attempt to directly handle dynamics, we argue for elevating such strategic concerns to the definition of interventions.
Partitioning into microscopic and macroscopic views has proven essential in the control of other complex systems.For example, macroeconomic metrics like gross domestic product and unemployment rate describe the state of our economy as a whole, and we use these metrics to reason about its long-term dynamics.Macro-level interventions are used to influence these metrics, like the interest rates set by the Federal Reserve.Similar examples are also widespread in engineered systems, where, for example, the macroscopic control intervention of a self-driving car (e.g., turn left 10 degrees) hides the microscopic execution of this command (e.g., voltages going to the steering motors) behind a control system.
Figure 1 illustrates an analogous micro-macro view of AI platforms, where the macro-level metrics we aim to optimize are quantities like customer satisfaction, retention, polarization, or the size of the supplier pool.It is not hard to think of possible macro-level interventions either, like the rate with which the service is interrupting users with push notifications, how aggressively to prune clickbait,

Macro-level interventions
Optimal & consistent micro-level interventions or how much exposure to give to smaller suppliers.Enabling reasoning at the level of macro-level metrics and interventions opens the door for future investigations of platform dynamics.At the macro level, it will be far more tractable to understand how interventions affect the long-term metrics we aim to optimize for.For example, establishing a causal model of how exposure allocation to small suppliers (a scalar) relates to the size of the supplier pool (another scalar) is considerably less complex than estimating a causal link between millions of rankings and supplier-pool size.
In this paper, we focus on macro-level interventions that represent constraints on exposure or impact.Exposure can be quantified by models like the position-based model (PBM) [12] which assigns a score to each position in a ranking, representing the probability of being viewed by the user.The impact can be directly measured by clicks or purchases.Adding such long-term constraints provides a rich new language for guiding system behavior, as illustrated by the following examples that can be implemented in our framework: (1) Give local artists at least  percent of the overall exposure over the next month.(Item Group Exposure) (2) Show new artist  to at least  users over the next week.
(Single Item Exposure) (3) Given a well-calibrated but imperfect spam filter, ensure that the expected exposure to spam across all users is less than  .
(Item Group Exposure with uncertain Group Membership) (4) Do not send more than  push messages on average per week to user  .(Single User Exposure) (5) Show each housing-related ad to protected groups by demographic parity over the lifetime of the ad.(Single Item / User Group Exposure) (6) Support goal of user  to buy at least 30% of products from local suppliers.(Single User Impact) These examples show that macroscopic interventions can shape the aggregate experience of items over a given time span (first three), but also the aggregate experience of users (last three).The macroscopic interventions can provide constraints on the experience of a single user or item (2,4,6), on the collective experience of item or user groups (1, 3), or on the complex interaction between item and user groups (5).These interventions may be exact when we have precise knowledge of class membership, or they may be approximate and fulfilled only in expectation (e.g., based on the probability of an article being spam).Finally, in some cases, the constraints directly act upon exposure (first five), while in example six the user asks the system to support a particular impact goal [41] that includes the reactions of the users (e.g., clicks, purchases).Similar macro-level constraints are also relevant to other aspects of AI platforms like ads [46,48,54,55] and ad-pacing [2,47] in particular.

MACRO/MICRO CONTROL PROBLEM
We now formalize the problem of translating a set of macro-level interventions into a sequence of micro-level actions.We model this translation as a problem of optimal control which seeks to ensure that the micro-level actions achieve the desired macro-level interventions in aggregate.This is analogous to how control is used in mechanical systems to provide layers of abstraction with clear semantics.Controllers are used to keep a system in desirable states even under external perturbations and incomplete knowledge of the dynamics (e.g., keep a plane in level flight).Under bounds on worst-case conditions, controllers can be proven to be stable, safe, and performant [5,40].Furthermore, the control perspective has already proven useful for other aspects of online systems [23,27,47,48,50,55].
The macro/micro control problem takes the following form.At each time step  from 1 to the final time step  , a new context   arrives.Each context   is drawn independently from some unknown and possibly shifting sequence of distributions At the micro-level, our goal is to derive a ranking policy that selects an action   (i.e., ranking) for each context that maximizes a microlevel ranking metric  (  |  ) (e.g., Discounted Cumulative Gain (DCG) [24]).Thus, over all contexts  1 , ...,   our policy should choose actions  1 , ...,   which achieve a large cumulative value: However, unlike conventional ranking policies, ours must also consider macro-level goals in addition to the micro-level utility.In particular, the policy needs to fulfill  constraints that range over all time steps from 1 to  : This means that we cannot directly evaluate the constraints, and must pick the current action under partial information.
To address this problem, we first introduce a state   that reflects the progress made towards fulfilling the constraints up to time : Then, we consider ranking controllers of the form   = Π(  ,   −1 , ).
The ranking   is chosen based on the context   , the state   −1 , and the time .The macro-level intervention is achieved when the constraints are fulfilled in the final time step  .This is equivalent to ensuring that the terminal state   ≥ .The controller aims to reach this target state but must make decisions only on the basis of the current state and context without exact knowledge of the future contexts.Fulfilling all constraints may not always be possible and, furthermore, it may not be desirable to arbitrarily sacrifice the micro-level objective from Equation (1).We thus consider soft constraints and denote the macro-violation cost as where  ≥ 0 is an  dimensional parameter vector that expresses how costly it is to violate each constraint.The "hinge loss" (•) + sets all negative components of the input vector to zero, so any dimension of the terminal state   that is above its target  contributes zero to the macro-violation cost.The overall objective is the sum of the micro-level utilities minus the macro-violation cost at the final time step  .For a given sequence of contexts, this objective is: Note that this final objective can only be computed after time step  .Since the actions must be chosen sequentially, this setting has the form of a closed-loop control problem, where the controller can react to the state   −1 .This control loop is summarized in Algorithm 1.We conclude our general setup with a discussion of metrics which depend on modeled vs. observed feedback.When metrics are defined by a model (e.g., the position-based model of exposure), the result of any arbitrary action  can be anticipated once the context   is observed.In contrast, metrics defined by observed feedback (e.g., clicks or hover time) are known only for the chosen action   and are observed only after the action is taken.The control loop illustrated in Algorithm 1 is valid whether the metrics defining  and  are modeled or observed.However, the action selection step defined by Π benefits from the ability to anticipate the effect of arbitrary actions.We, therefore, focus on modeled metrics in the controller development below-in other words, we take  (•|  ) and  (•|  ) to be known functions.We call this the full information setting, referring to the fact that the context provides sufficient information.In practice, controllers may operate on the basis of imperfect models (e.g., using a learned relevance predictor instead of true relevances), even if the closed-loop logic proceeds according to observed feedback, and better models may be learned interactively by using this feedback.However, for clarity of exposition, we leave such a scenario to future work.

Linear Utilities and Constraints
We now define a specific class of models for describing the relationship between an arbitrary action  and the macro-and micro-level objectives for a given context .For ease of exposition, we will refer to the micro-level objective  (|) as "utility" and each macro-level   (|) as "progress towards intervention ".
The utility depends on both the relevance of the items and their ranked positions.In the full information setting, the relevance of the items can be determined from context   .Denote by r , the relevance of item  at time .Further define a position-dependent weight u  for each position for  ∈ [] (e.g., Discounted Cumulative Gain [24] is u  = log 2 ( +1) −1 ).Then an item  ranked in position  contributes r , u  to the utility.Denoting by rank(  |  ) the position of item  under the ranking specified by   , the utility has a linear form: Without loss of generality, we assume that u  is non-increasing in position , meaning that the higher an item is ranked, the more its relevance contributes to the utility.
In the absence of macro-interventions, a utility-maximizing action sorts the items in order of their relevance scores:   = argsort r  .However, our ranking controllers also consider progress towards the macro-level interventions of the following linear form: Above,  ,  determines the contribution of item  to intervention  (e.g., an indicator of group membership).As with relevance, we assume the full information setting, so that the context   contains enough information to determine this quantity.Denoted by e  is another position-dependent weight.It is not necessary to assume that it is equal to u  or even that it is also non-increasing in .
From here forward, we denote the parameters of the utility and progress functions by vectors and matrices: r  ∈ R  , u ∈ R  ,   ∈ R × , and e ∈ R  .We assume that position weights u and e are known 1 and that the context provides full information about utility and interventions so that   = (r  ,  ).
For the ranking controllers that we develop, it is convenient to use permutation matrices for representing rankings.A permutation matrix has exactly one entry equal to 1 in each row and column, and 0 elsewhere.If Σ ∈ R × represents a ranking, then Σ   = 1 means that item  will be placed in position .Using this notation, the utility and exposure quantities can be written in a compact matrixvector notation, identifying   with the corresponding permutation matrix Σ  : Since searching in the discrete space of permutations can be computationally challenging, it is sometimes convenient to search over ranking distributions.This corresponds to considering policies represented by doubly stochastic matrices rather than permutation matrices.The set of doubly stochastic matrices is defined as Given a doubly stochastic Σ, a ranking can be sampled via the Birkhoff-von Neumann decomposition [6,41].

CONTROLLERS FOR RANKING
We now introduce three controllers to address the macro/micro control problem.We begin with a baseline myopic controller that has a high reduction in micro-level engagement.Next, we use the lens of online optimization to introduce a controller appropriate for stationary context distributions and draw the connection to a previously proposed P-controller for ranking under fairness constraints [36].Finally, we develop a more sophisticated predictive controller that can anticipate and plan for non-stationarities in the context distribution.

Myopic Controller (MC)
Actions must be chosen at every time step  without knowledge of future contexts.As a result, the controller cannot exactly optimize (6).A simple idea to address this issue is to define an intermediate objective at each time : This intermediate objective scales the target  linearly by   and removes the effect of future timesteps.Note that this is not equivalent to the original objective due to the nonlinearity of the hinge loss.Effectively, this objective treats every timestep as if it were the final timestep (albeit scaling the target value).Since there are no future timesteps to consider under this simplified objective, maximizing the objective for selecting the current action   is well-defined.This leads to the following controller, which we call the Myopic Controller (MC): This expression contains only the terms from the objective (8) which affect the argmax.Notice that the maximizing action depends on the past actions and contexts only through the state   −1 .Algorithm 2 presents the linear program (LP) implementation.
Algorithm 2: Myopic Controller LP Input: While the intermediate objective at each time step is simple, it is overly strict.Specifically, it charges the full violation cost at the current time step if the controller is unable to reach the scaled target   .It thus ignores that the full violation cost is truly incurred only in the final time step, and that the intermediate violations may cancel out before the final time step.We can thus expect this controller to perform very conservatively.

Stationary Controller (SC)
To address the inappropriate strictness of the Myopic Controller, we turn to ideas from online convex programming.Algorithms developed for this setting select optimization variables, in our case actions, at each time step based on streaming optimization parameters, in our case contexts [3].As a first step, consider the following objective, where the inequality constraints on the Lagrange multiplier vector  ∈ R  are defined elementwise.Besides rescaling by 1  , this is equal to the original objective (6): minimizing over the multiplier  constrained to [0, ] is an exact reformulation of the constraints implicit in the hinge loss appearing in the macro-violation cost.
So far, the reformulation does not solve the problem of intermediate objectives, since minimizing the multiplier  requires summing over the entire horizon.However, if the value of  were fixed, then the objective would be separable over timesteps, and maximizing with respect to the action   at time  would no longer depend on the future.But we would have the same issues as the fixed violation cost in the MC.How should a value of  be selected?The key insight from the online optimization literature is to alternate between maximizing the objective while holding  fixed and updating  to iteratively minimize the objective.Updating the multiplier in this way is like learning a dynamic violation cost.Concretely, actions are chosen according to This expression is simplified to contain only the terms which affect the argmax.It can be interpreted as approximating the average utility and macro-level progress over time as the utility or progress at each time step.This is well motivated for i.i.d.contexts [3], and we therefore call this the Stationary Controller (SC).It remains to specify the multiplier updates.In general,   is defined based on   −1 and the gradient of the objective with respect to the multiplier: 1    −  (  |  ).In experiments, we use a variant of online gradient descent with adaptive step size.For the sake of exposition, we derive a closed-form expression for the controller in the simpler case of gradient descent with fixed step size  > 0 and initialization  0 = 0.In this case, The multiplier   is exactly the tracking error between the linearly scaled target and the current state, scaled by the step size .Accounting for the bounds on , the closed-loop control law can be written as where Σe update   from   −1 with  and gradient 1    −   Σ  e Return:   ∼ Σ  3.2.1 Proportional (P) control.We briefly outline the connection between the Stationary Controller and a (seemingly) heuristic method for boosting the position of certain items within a ranking.Proportional (or simply "P") control is a general control technique which applies a correction proportional to the size of a tracking error.In the context of ranking, P-control makes direct adjustments to relevance scores and was first proposed by Morik et al. [36] for achieving long-term fairness constraints.
To draw this connection, we write the control law Π SC (  ,   −1 , ) as a linear optimization problem and assume that u = e: The equality holds because u is non-increasing.Therefore, in this special case, the control law simply sorts items by adjusted relevance scores.The adjustments are proportional to the tracking error, where the transpose matrix  ⊤ can be understood as translating from macro-level goals to individual items.Therefore, items associated with lagging macro-level metrics will be boosted.The step size parameter  can be interpreted as the "gain" of the P-controller, determining its sensitivity to tracking errors.
The P-controller is usually thought of as a heuristic, derived without reference to an overall objective.Its score adjustment does not usually take into account the macro-violation cost parameter  and it cannot account for possible differences between u and e.Despite this cost-obliviousness, the derivation above shows that P-control arises as a special case of our Stationary Controller.

Predictive Controller (PC)
All controllers presented so far attempt to make progress towards macro-level goals at a constant rate over the horizon: at step , the target is determined to be   .This does not account for nonstationarity in the context distribution.For example, certain types of items may be relevant only on weekends or evenings.Attempting to progress on macro-level goals at a linear rate fails to take into account variable underlying demand.
We therefore derive a predictive controller which accounts for the entire time horizon.Denote future actions as  ℎ = ( ℎ+1 , . . .,   ) and similarly for contexts  ℎ .The total progress can be written as where  ( ℎ | ℎ ) is the "progress-to-go" at ℎ defined as the sum of  (  |  ) for  from ℎ + 1 to  .The first term in the above expression is the state: the accumulated progress so far.The middle term is the contribution at time ℎ, and the final term is the cumulative progress to come.This expression explicitly separates the contribution of the past (known), present (current decision), and future (unknown).The portion of the optimization objective (6) depending on the action   at time  can be written as Notice that because the utility is separable over time, the contributions of past and future utility do not impact the decision at time .Due to the hinge loss, the macro-level goal is not separable over time.The progress-to-go  (  |  ) depends on the future contexts, which are unknown, and future actions, which are hard to choose without knowing the contexts.Instead, we propose using predicted values denoted by   .In Appendix B.1, we present methods for forecasting the progress-to-go from historical data.
The following develops a multi-forecast predictive controller that can make use of such progress-to-go estimates.Given a bootstrap sample of  forecasts of the progress-to-go, the predictive controller selects an action which maximizes the average objective over these  possible futures.The multi-forecast objective at time  is represented by the following optimization problem: Finally, we introduce multipliers for each forecast and use the same alternating online optimization approach as developed in the previous section for the Stationary Controller.Putting together all the pieces, the predictive controller Π PC (  ,   −1 , ) selects actions according to: and updates each multiplier by defining    based on    −1 and the gradient of the objective with respect to the multiplier: For the simple case of online gradient descent, this update takes the form Similar to the Stationary Controller, actions are chosen according to a weighted objective of utility and progress towards the macrolevel targets.However, while the Stationary Controller updates the multiplier to target a linear rate of progress, the predictive controller updates the multiplier depending on potentially non-stationary forecasts of progress-to-go (e.g., expected higher demand on the weekend).Algorithm 4 presents the LP implementation.

EXPERIMENTS
While each controller comes with a strong conceptual and theoretical motivation, we now evaluate how far these arguments translate into improved empirical performance.In particular, we evaluate the controllers on real-world datasets to assess their differences on realistic data.Furthermore, we present experiments on synthetic data to explore in which situations PC outperforms SC.Implementations of the controllers and code for reproducing the experiments are available at https://github.com/xkianteb/ranking_constraints.

Experiment Setup
In addition to the controllers discussed in Section 3, we include results for two additional controllers for comparison.As an (unachievable) skyline, we report the performance of an oracle controller that has access to the whole sequence of test time contexts and directly optimizes the overall objective.For further comparison, we also include the unconstrained controller (MC w/o constraints), which only optimizes utility without enforcing any macro-level interventions.
We conduct experiments on three datasets: The first dataset is KuaiRec [18], which is a fully observed dataset collected from the recommendation logs of the video-sharing mobile app Kuaishou.The KuaiRec dataset consists of 1,411 users, 3,327 items, 4,676,570 interactions, and has a density of 99.6%.We filter to include only items that every user has interacted with, which reduces the number of items from 3,327 to 2,062.We consider the task of ranking videos for sequentially arriving users.Since the KuaiRec dataset does not provide relevance scores, we define the relevance score as half of the normalized watch ratio: _/(2 * _), capped at 1.This is the recommended relevance signal provided Algorithm 4: Predictive Controller LP Input:  by the dataset publishers 2 .We define an exposure intervention on two arbitrarily chosen groups to evaluate the performance on multigroup constrained ranking.Each group contained two videos, one of which overlapped.In particular, we set the exposure targets to be 1.1 times and 3 times the exposure of the unconstrained controller (MC w/o constraints).
The next dataset we consider is the linear television dataset Tv Audience [43].This dataset contains temporal television watching behavior for 13k users: the watch duration consists of 217 channels over 19 weeks with an hourly time resolution.For our experiment, we only use the first 12 weeks and ignored the remaining weeks, which results in a total of 288 timeslots.We consider the task of ranking channels over time.The relevance score of a channel during a particular hour is defined as the number of viewers normalized by the channel's maximum viewers per hour over the past several weeks.To evaluate the temporal prediction capabilities of controllers, we define an exposure intervention on a group of one arbitrarily selected late-night channel that users mostly watch during the night or late evening hours.The intervention is a 100% exposure boost, which is equivalent to setting the exposure target to be twice that of the unconstrained controller (MC w/o constraints).
For our final dataset, we created a fully synthetic dataset to better understand the situations where PC outperforms SC.This dataset consists of temporal patterns of relevance scores for eight items over a horizon of 400 steps.The first four items are always relevant and have the highest relevance scores.The remaining four items are used to define exposure interventions of two disjoint groups.Two of the remaining items form one group and are most relevant during the first half of the time horizon, while the other two items form the second group and are most relevant during the second half of the time horizon.Unlike the previous two datasets, all controllers are trained and evaluated on the same exact relevance scores.This means that we can assume accurate knowledge of the future context distribution when forecasting   , which ensures that bad forecasts do not confound the evaluation of the controllers.
Metrics.For our experiments, we use discounted cumulative gain (DCG) [24] as u for the utility metric and reciprocal rank (RR) e  = 1/ as our exposure curve [26]. 2 see the data description section on https://kuairec.com/Hyperparameters.SC and PC have parameters that must be tuned and estimated to achieve good performance.We tuned these parameters by dividing the data into three sets: train, development, and test.For the PC, we used the train set to estimate the forecast of the progress-to-go and the development set to simulate online contexts as described in Appendix B.2.We performed a grid search with the train and development sets to select the best forecast parameter based on the overall objective (6).To update the multiplier , we utilized the Adam optimizer [28] for both SC and PC.Additionally, we performed another grid search with the train and development datasets to pick the best Adam hyperparameters based on the overall objective (6).In our experiments, we investigated the performance of each controller for different macro-violation penalty values.We selected the best-performing hyperparameters separately for each controller and penalty value, and we repeated this process for all datasets.Detailed hyperparameter ranges are given in Appendix C.

Experiment Results
In this section, we evaluate key properties of the controllers using the datasets discussed in the previous section.For all experiments, the x-axis represents the varying macro-violation vector  (ranging from 1 × 10 −2 to 1 × 10 2 ), and the y-axis shows the performance of the controllers in terms of the utility (1), violation cost (5) or the overall objective (6).

Which controller achieves the highest overall objective?
Figure 2 compares all controllers on both KuaiRec, a stationary dataset, and Tv Audience, a non-stationary dataset with temporal shifts in the context distribution.The performance of the oracle policy provides a skyline because it has (unrealistic) knowledge of all future contexts.We first note that for both datasets, all algorithms are largely equivalent when the macro-violation cost factor  is small since a small cost factor implies little influence on the objective compared to short-term utility maximization.As the importance of the long-term constraints increases with increasing macro-violation cost factor , the MC performs substantially worse than the other controllers, because its actions are chosen at every time step without consideration of how increasing violations in the current state interact with future contexts.The right middle plots  show, across both datasets, that under an increasing violation cost, all controllers eventually treat the macro-constraints as hard constraints and reduce their violation cost close to zero.However, both SC and PC perform substantially better than MC in terms of the overall objective.Comparing SC and PC on the KuaiRec dataset, we see that both perform about the same.This is to be expected since there is no temporal pattern that PC could learn to exploit with its progress-to-go estimates.Instead, it can only learn to predict the average exposure, which is precisely what the SC is optimizing.
On the Tv Audience dataset, which has non-stationary temporal patterns, we see that the PC can plan for the temporal pattern and perform better than all other controllers.

How sensitive is PC to the number 𝐵 of forecast samples?
While the PC performs well on temporal datasets and matches the performance of SC on non-temporal datasets, it has the number of forecast samples  as an additional parameter that needs to be selected.Figure 5 in the appendix shows the performance of PC depending on the choice of  for different cost vectors.Note that the results on Tv audience dataset are the median of 20 independent runs since the length of the test set is only 48 introducing a great amount of noise.When the number of forecasts is greater than 20, then PC performs well on both datasets, giving a reference point for selecting .However, additional savings in computation time are possible for some datasets since we find that on the KuaiRec dataset, smaller values of  can suffice to get good performance.

When is it advantageous to use a predictive controller?
From the experiments in Figure 2, we see that PC performs better than other controllers on the Tv Audience dataset, which has a non-stationary temporal pattern.To further illustrate this point, Figure 3 is a salient example that explores when the PC should be preferred over other controllers that are oblivious to any temporal patterns.In the figure, each column represents the controller utility at the top and exposure at the bottom.There are eight items in total and one item in each group.We compute nDCG@4 and RR@4; which means only the top-4 items receive utility and exposure.
When no macro-level exposure goals are enforced, as seen in the MC w/o constraints, the exposure for both groups is zero because they are not among the top-4 ranked items.As illustrated by the oracle controller, the first group is more relevant during the first half of the time horizon, while the second group is more relevant during the second half.The PC performs similarly to the oracle controller because it can leverage this temporal pattern.The SC is better compared to MC due to tuning the gain parameters which allows for additional flexibility.However, both MC and SC follow a linear exposure target and are thus unable to boost the two groups separately.

CONCLUSIONS
We formalize and address the problem of how to design algorithms that convert macro-level goals into a sequence of individual rankings that have the least impact on micro-level metrics.The algorithms we introduce are analogous to how control is used in mechanical systems to provide layers of abstraction with clear semantics.By introducing three new controllers, we cover a range of application scenarios.Furthermore, we provide rigorous justification for proportional controllers for ranking.Of the three controllers we introduce, we find that controllers based on online optimization (i.e., SC and PC) outperform the more naive MC controller.Furthermore, we find that the predictive controller (PC) performs better than the stationary controller (SC) in non-stationary settings.This paper opens up a wide range of future work.By making new technical connections between ranking and control theory, it opens up a new set of tools for designing adaptive ranking policies.Furthermore, we anticipate that the macroscopic view of ranking platforms we introduce will provide a conceptual framework for making these platforms more steerable.

ETHICAL CONSIDERATIONS
Understanding that search and recommendation AI platforms are socio-technical systems means that platform designers must consider the systems' technical and social aspects.In particular, when recommendation systems are being optimized for short-term engagement, it is potentially at the cost of the long-term sustainability of the platform.Not considering long-term sustainability could have social implications like amplifying misinformation or providing disparate utility to different groups.This could affect users' longterm satisfaction with a platform.Our work focuses on developing algorithms that incorporate a platform designer's long-term goals while maintaining a platform's short-term goals.Our intent is that this will enable platform designers to better incorporate all of the socio-technical aspects of a system into the algorithmic decisionmaking of a search and recommendation AI platform.However, the increased ability to control long-term platform behavior does not automatically lead to better behavior, and responsible governance in setting long-term goals is of crucial importance.

A NOTATION SUMMARY
u, e ∈ R  micro and macro position weights Σ ∈ Δ doubly stochastic ranking matrix

B CONTROLLER IMPLEMENTATION
The controllers developed in Section 4 depend on various parameters.All depend on the macro-level intervention targets  and violation cost vector , which we assume are specified by designers.The Stationary Controller and Predictive Controller additionally depend on an optimization parameter  which determines the multiplier updates.Due to the connection with P-control discussed above, we refer to this parameter as the gain.Additionally, the Predictive Controller depends on forecasts of the progress-to-go.We now discuss how to use offline data to determine these quantities.

B.1 Estimating the Progress-to-go
The Predictive Controller requires several forecasts of the progressto-go.We estimate these forecasts using offline data.The offline data defines an empirical distribution of contexts over time.Suppose that the offline data contains  contexts which we will index by .Then for offline bootstrap sample  and time step  we sample the context by its index   .We do so in a stratified manner to preserve relevant temporal relationships in the data.Depending on the setting, this sampling procedure may treat hour of the day, day of the week, etc, equivalently.By sampling contexts by their indices for all  and , we construct several sampled sequences of contexts from this time-dependent distribution: {(   )   =1 }  off =1 .However, the progress-to-go is not determined only by contextsit also depends on the sequence of actions.We construct actions using an offline optimization over the entire horizon, using the exact objective (6) and the sampled sequences of contexts.Algorithm 5 presents the resulting LP.Notice that the ranking actions are not independent at each time step or in each bootstrap sample.Rather, the optimization problem finds the best contextual policy that is stationary over time.This formulation has two advantages.First, it prevents the optimization problem from growing with the horizon or with the number of bootstrap samples.Second, by constraining the actions to come from this reduced policy class, it prevents the offline optimization problem from over-exploiting its ability to see all contexts, in contrast to the partial information faced by the online controller in practice.This helps to ensure that the forecasted progress-to-go variables are not too ambitious.
Finally, the resulting sequence of (approximately) optimal actions along with the sampled context sequence define the progress-togo at each time step.A total of  on ≤  off forecasts are created through this process.

B.2 Tuning the Gain
Both the Stationary Controller and Predictive Controller depend on the gain parameter .To tune its value, we again use offline data to sample sequences of contexts.We sample contexts in a timesensitive manner as described in the previous subsection.These samples are used to simulate closed-loop control and the resulting performance approximates the performance of a controller with the given parameter .We then do a simple grid search on the parameter .Algorithm 6 describes this procedure.

B.3 Multiplier Update Algorithms
There are many possible ways to update the multiplier variables in implementing SC and PC.Perhaps the simplest is Online Gradient Descent (Algorithm 7).We initially experimented with Gradient Descent with Momentum, but ultimately we found the Adam optimizer [28] (Algorithm 8) to be most successful.

C.2 Additional Experiments
We conduct additional ablation experiments using the Last.fmdataset [11].This dataset contains tuples of users, artists, and the amount of time that a user listened to a particular artist.We consider the task of ranking artists for sequentially arriving users.We define the relevance score for a user and artist to be the play time and consider a subset of artists.Of the total 292,385 artists and 358,868 users, we perform experiments using a subset of 1,373 users and 50 artists.We define an exposure intervention on a group containing two artists.The two artists are selected based on the listening behavior of two disjoint sets of users.Within each set of users, the top-15 artists are similar, but between the two sets, they are non-overlapping.The two artists are chosen to be the 15th most popular artist within each of the two user sets and the exposure target is set to ten times the original unconstrained maximizing ranker exposure.We use this structure to create a temporal pattern in the data: for the first half of the time steps, contexts are defined by users sampled from the first set of users.In the second half, they are sampled from the second set.Furthermore, for this dataset, we only consider DCG@15 instead of DCG across the entire set of items to be ranked.For a fair comparison we keep the dataset and splits the same and only vary shuffling order.The non-temporal version of last.fmshuffles the contexts, which breaks the temporal pattern.We see that similar to the  2, when the cost is small, all controllers trade of small violation to obtain a larger objective value.We see that PC performs worse when using the non-temporal dataset than the temporal dataset.

Figure 1 :
Figure 1: We propose to separate macro-level control used for steering the long-term dynamics of the platform from its micro-level engagement optimization.The interface layer provides an abstraction by optimally translating strategic macro-level interventions into a sequence of micro-level actions with minimal impact on short-term metrics.

Figure 2 :
Figure2: Experiment results comparing all controllers across two datasets KuaiRec and Tv Audience.The x-axis is  on a log-scale in all plots.The first column is the final objective (6) value, the middle column is the the utility metric (DCG), and the final column is the macro-violation.The oracle has access to the test time contexts and directly optimizes the original objective(6).The MC w/o constraints is an unconstrained utility maximizing controller.

Figure 3 :
Figure3: Comparison of all controllers on a synthetic dataset to showcase when the PC should be preferred.The o's and x's represent two groups of items.The top row is the average utility over time and the dashed grey line represents the highest achievable utility under the exposure constraint.The bottom row displays the exposure over time of both item groups.The grey x's and o's represent the target exposure for the groups.

Figure 4 :
Figure 4: Comparison of two different versions of the last.fmdataset.The left plot enforces a temporal pattern duringand the right plot shuffles the dataset and breaks the temporal pattern.Furthermore, the test time contexts have a temporal pattern, and the target exposure is kept the same across both plots.

Figure 5 :
Figure 5: Comparison of different forecast samples used for computing the progress-to-go.The KuaiRec dataset on the left is non-temporal and the Tv Audience dataset on the right is temporal.

Table 1 :
Hyperparameters used for experiments C ADDITIONAL EXPERIMENT DETAILS C.1 HyperparametersWe provide a table the hyperparameters that we used for tuning each of the controllers in our experiments.