Learning the Optimal Control for Evolving Systems with Converging Dynamics

We consider a principle or controller that can pick actions from a fixed action set to control an evolving system with converging dynamics. The actions are interpreted as different configurations or policies. We consider systems with converging dynamics, i.e., if the principle holds the same action, the system will asymptotically converge (possibly requiring a significant amount of time) to a unique stable state determined by this action. This phenomenon can be observed in diverse domains such as epidemic control, computing systems, and markets. In our model, the dynamics of the system are unknown to the principle, and the principle can only receive bandit feedback (maybe noisy) on the impacts of his actions. The principle aims to learn which stable state yields the highest reward while adhering to specific constraints (i.e., optimal stable state) and to immerse the system into this state as quickly as possible. A unique challenge in our model is that the principle has no prior knowledge about the stable state of each action, but waits for the system to converge to the suboptimal stable states costs valuable time. We measure the principle's performance in terms of regret and constraint violation. In cases where the action set is finite, we propose a novel algorithm, termed Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B), that knows to switch an action quickly if it is not worth waiting until the stable state is reached. This is enabled by employing "convergence bounds" to determine how far the system is from the stable states, and choosing actions through maintaining a pessimistic assessment of the set of feasible actions while acting optimistically within this set. We establish that OP-C2B can ensure sublinear regret and constraint violation simultaneously. Particularly, OP-C2B achieves logarithmic regret and constraint violation when the system convergence rate is linear or superlinear. Furthermore, we generalize our algorithm OP-C2B to the case of an infinite action set and demonstrate its ability to maintain sublinear regret and constraint violation. We finally show two game control problems including mobile crowdsensing and resource allocation that our model can address.


INTRODUCTION
Plenty of large-scale complex systems, such as society, markets, and computing systems, often exhibit a delayed response to controllers (e.g., government, operators) policies, necessitating a considerable amount of time (possibly involving slow convergence) to reach a stable state (or steady state) under the controller's policy.In distributed systems, operators distribute requests across different servers to achieve balanced resource utilization.Over time, with a fixed load balancing algorithm, the system tends towards an even distribution of requests across servers, reaching a state of load balance.Similarly, in economic systems, government interventions like adjusting interest rates or the money supply aim to control inflation and promote employment.The entire market gradually converges to an equilibrium state under the influence of specific government policy.
However, in practice, controllers often lack the ability to foresee the stable state of a complex system under a particular policy and are limited to real-time (potentially noisy) observations of the impacts of their policies.Allowing complex systems to naturally reach stable states under each potential policy is impractical, given the potentially infinite policy space and the significant time wasted by unreasonable policies.Moreover, when a complex system becomes entrenched in a stable state resulting from unreasonable policies, the controller incurs high costs.For instance, in economic control, inappropriate monetary policies can lead to inflation or deflation over time, triggering economic instability, increased unemployment rates, and substantial economic losses.In distributed systems, unreasonable load balancing policies can result in high congestion on some resources when systems reach stable states.This increases the operation costs for the operator, degrades the Quality of Experience (QoE) for users, and potentially leads to system failures.Another example is in epidemic control, where the government has to switch policies to reach the most effective policy (e.g., lockdown, masks) and then commit to it to control the spread.Otherwise, it suffers from high healthcare costs and economic losses.These examples underscore that when the controller is unable to swiftly learn and adjust unreasonable policies, he may incur multifaceted and substantial costs.Therefore, in situations where complex systems exhibit slow convergence and the controller is constrained to real-time observations of policy impacts, it becomes crucial for the controller to switch policies strategically to learn the best policy and commit to it as quickly as possible.This introduces a learning problem of controlling such evolving systems with converging dynamics.
Motivated by this, we explore an abstract model consisting of a principle (e.g., government, operator) and an evolving system (e.g., society, market, and computing system).The principle takes actions (represented as different policies) from a fixed action set at each time step.These actions control an underlying system that evolves over time, and at each time step, the principle receives bandit feedback determined by his action and the state of the system, with independently and identically distributed (i.i.d.) noise.The critical feature of the underlying system is its asymptotic convergence to a unique stable state if the principle fixes an action.As previously mentioned, allowing the system to become ensnared in suboptimal or inefficient stable states (resulting from suboptimal actions), rather than the stable state corresponding to the optimal action, can lead to poor and possibly catastrophic global performance.The poor global performance associated with suboptimal stable states manifests primarily in two dimensions.Firstly, the reward or utility for the principle in such states is not maximized (e.g., high operational costs).Secondly, the principle may fail to meet essential or required constraints in these states.For example, the principle needs to ensure safe behavior, maintain load balance, or satisfy resource budgets.In more severe instances, the violation of these constraints can result in system failures, as an example, load imbalance in electricity grids can trigger system failures.To quantitatively assess the inefficiencies arising from the principle's suboptimal actions, we employ constraint violation and regret as performance metrics, where regret is defined as the difference between the reward at the optimal stable state (for the optimal action) and the cumulative reward obtained by the principle.This exploration leads us to investigate a novel (constrained) bandit learning problem.Unlike traditional (constrained) bandit learning problems, the reward (or constraint values) process in our studied model is not i.i.d.over time.Instead, it is history-dependent (i.e., relies on the system state) and the expected reward Learning equilibrium in repeated Stackelberg games.Recently, the learning of equilibrium in repeated games has been attracting a great deal of attention.Among these works, closer to ours is the one studying repeated general-sum Stackelberg games (or leader-follower games) between two players in the bandit feedback setting, i.e., treating the leader as the principle and the follower as the evolving system.[47] study the problem of learning the optimal leader strategy in Stackelberg games with a follower best response oracle, which means that the leader knows the exact follower best response to its action.[4] study learning algorithms for Starkelberg equilibrium in Stackelberg games with noisy bandit feedback in a batch version.Their proposed algorithm needs to query each pair of leader-follower actions for sufficient rounds to calculate the empirical mean.Therefore, it requires a centralized authority to learn the equilibrium instead of distributed player self-learning, as is in this paper.This line of work on centralized learning in repeated Stackelberg games are well studied in the following works [20,65] where they assume a central controller or machine learning algorithm that can determine the actions of both the leader and the follower.Notably, the problems studied in recent works [10,62,64] can be subsumed into our model.[64] study the problem of online learning in a linear repeated Stackelberg game where the follower always best responds to the leader's actions.This is equivalent to the system is our model immediately converging to the stable states of the principle's chosen actions.However, they assume the utility function of the leader can be well-parameterized by a linear function to support their results.[10,62] consider the setting where the follower cannot get the exact best response and needs to learn to best respond to the leader's actions, i.e., both the leader and follower are jointly learning to reach an equilibrium through no-regret learning algorithms.Compared to our model, these works focus on the specific dynamics of the follower (e.g., best response, no-regret learning, etc) and the specific leader's reward functions.We do not restrict our attention to specific system dynamics and principle's reward function, i.e., we only require the system can asymptotically converge to the stable states of principle's chosen actions and the reward function to be Lipschitz continuous.In fact, our algorithm can be used for the leader's strategy in these works and achieve better theoretical performance, as an example, improving  ( √  ) regret bound to  (log ) in [62].Additionally, their methods are not well-equipped to handle either an infinite action set [62,64] or additional (time-varying) constraints [10,62,64].
Learning to control of games.Our work is also related to the literature on control and intervention in games from an online learning perspective, where a principle can tune some parameters in the reward functions of the players [6,41,46,48].While our dynamics do not necessarily stem from a game, games are a key example of a system that converges to a stable state (i.e., equilibrium).Additionally, these works do not consider any time-varying or game-dependent constraints.From this point of view, our work is the first to provide theoretical guarantees for learning to control an unknown game with the game's dependency constraints.

Contributions
Our main contributions can be summarized as follows: (1) We propose a model for learning the optimal action/policy for an evolving system with converging dynamics.We present our model in a general setting that can capture many important applications of learning to control an unknown game, and detail two applications within our model.In cases where the action set is finite, we present a novel algorithm called Optimistic-Pessimistic Convergence and Confidence Bound (OP-C2B), which builds on the basic intuition of maintaining a pessimistic assessment of the set of feasible actions while acting optimistically within this set.Our main innovation is employing "convergence bounds" to determine the maximum possible reward or minimal constraint values the principle incurs by waiting for the system to converge to the stable state for a given action.A chosen action in OP-C2B is played consecutively for a full "epoch".OP-C2B balances between the i.i.d.noise and convergence noise by only using a fraction of samples from a given epoch for estimation.
We prove that OP-C2B achieves sublinear regret and constraint violation of Õ ( (1− 2 ) + ) for polynomial system convergence rate.The bounds of regret and constraint violation can be further improved to logarithmic for OP-C2B when the system convergence rate is exponential.Notably, OP-C2B can achieve performance bounds of the same order as having a predictive mode (knowing the stable state of each action), as long as the system convergence rate is linear or superlinear.(2) Furthermore, we extend our OP-C2B algorithm to accommodate an infinite action set and showcase its capacity to maintain sublinear regret and constraint violation.Specifically, we can ensure regret and constraint violation guarantees of the order Õ ( exp(− 1 ) for exponential system convergence rate, and Õ ( 2(1+ ) ) for polynomial system convergence rate, where  ≥ 0 is the tuning parameter.

OUR MODEL
This section presents the model we have studied.Before specifying our problem formulation, we introduce below some notations that will be used.

Problem Formulation
We consider a principle's online decision problem faced with a fixed action set A ∈ R  .At each time , the principle chooses/plays an action   from the action set A. The action controls an underlying system that evolves with time and affects the principle's reward and constraint values.We denote   as the state of the system at time , which lies in a bounded and closed set Z ⊂ R  .Let  represent the diameter of the state set Z, which is known to the principle.The state of the system relies on the previous state and the action taken, i.e.,   =   (  ;   −1 ), where   is the system's evolution function at time .After taking action   , the principle does not observe the underlying system state   and can only observe noisy versions of the reward  (  ;   ) and incurred constraint values (  ,   ).Here  (; ) : A × Z → [0, 1] is the principle's reward function that determines the principle's expected utility, and (; ) = ( 1 (; ), ...,   (; )) : A × Z → [−1, 1]  are the constraint functions that the principle needs to satisfy.Formally, at time , the principle observes a noisy reward of   =  (  ,   ) +   ∈ R and noisy incurred constraint values of   = (  ,   ) +   ∈ R  , where   ∈ R and   ∈ R  are i.i.d.zero-mean noises.For generality, we consider the forms of  (; ) and (; ) to be unknown to the principle.The observations for rewards and incurred constraint values are noisy as the effectiveness of a policy cannot be deduced accurately and is often based on stochastic data.
In our paper, the evolution function   is unknown to the principle.We do not assume   is i.i.d. or has some specific forms, but only that it satisfies the following properties: Assumption 1.The system evolution satisfies the following conditions: (1) For each action , consider the iteration   =   (;   −1 ) for  ≥  0 , i.e, the action  is always played after  ≥  0 .There exists a unique stable state corresponding to action .That is to say, there exists a  *  such that lim  →∞ ∥  (,   −1 ) −  *  ∥ = 0 no matter what   0 −1 is.(2) For each action , consider the iteration   =   (;   −1 ) for  ≥  0 .There exists a decreasing sequence In particular,  0 can be set to  since it always holds that ∥  0 −1 −  *  ∥ ≤ .
Part (1) of Assumption 1 implies that if the principle keeps the action  fixed, the system will asymptotically converge to the stable state  *  .For notational convenience, we denote  *  =  (,  *  ) and  *  = (,  *  ) as the "stable reward" and "stable constraint values" for action , respectively.We call an action  feasible if  *  ≤ 0 (i.e., satisfying the constraints at the corresponding stable state); otherwise it is called unfeasible.The sequence   in part (2) characterizes the "convergence time to the stable state", i.e., the convergence rate.In fact, assuming Part (2) in Assumption 1 subsumes Part (1) in Assumption 1, as the validity of Part (2) in Assumption 1 also implies the validity of Part (1) in Assumption 1.We explicitly state Part (1) in Assumption 1 for the convenience of presentation.Next, we introduce several examples of the system dynamics that satisfy the above assumption.
• Contraction mapping.A basic system dynamics that satisfies the above assumption is a contraction mapping.A contraction mapping has a unique fixed point which is the unique stable point required for part (1), and the sequence   would be exponentially decaying, i.e.,   =  (exp(−  )),   > 0. Contraction mappings are commonly found in solutions of many Ordinary Differential Equations (ODEs) such as Newton's method, and in popular policy evaluation schemes such as temporal difference learning.• Continuous multi-player games.Our another example of system dynamics is deterministic learning in continuous multi-player games.In this example, the underlying system is a set of  players.Each player  ∈ [ ] has their own decision variable  , ∈ Z  at time , and the state variable of the system is just a concatenation of all the decision variables   = ( ,1 , ...,  , ) ∈ Z = Z 1 × ... × Z  .Each player has their own utility function, which depends on the decision variables of all the players and is parameterized by the action/policy taken by the principle.Specifically, at time , if the action taken by the principle is   , then the utility function for player  is given by   (  ;   ).The objective of players is to control their own decisions to learn to reach the Nash equilibrium (NE) among them, as they have no access to the utility functions of other players.For this type of system, one common scenario is that each player lacks knowledge of his utility function and can only observe the received utility/payoff at each round, i.e., payoff-based learning of NE in games.Given the condition of convex game with strongly monotone pseudo-gradients, these players can achieve convergence to the NE  *  at linear rate [52] if the principle keeps action  fixed, i.e.,   =  ( −1 ).Without the condition of strongly monotone pseudo-gradients, they can achieve only a square root convergence rate  ( −1/2 ) (i.e.,   =  ( −1/2 )) under additional requirement of Lipschitz continuity for the pseudo-gradients [52].Furthermore, if each player has access to their own utility function and decision variables of other players, employing classic no-regret learning strategies for all players can ensure exponential convergence to the NE [6], i.e.,   =  (exp(−  )),   > 0. Motivated by the above two examples, in our paper we consider two cases of exponential and polynomial decays for {  } which commonly appear in practical applications: As the principle usually has some prior knowledge about the worst-case bound on the actual convergence time, we assume that the principle knows the lower bound of  1 or  2 (conveniently, we still use the symbols  1 or  2 to represent it).
We now move on to the objective of the principle.The objective of the principle is to commit to the optimal action  * as quickly as possible, which is defined as i.e., the feasible action with the highest stable reward.For simplicity, we assume that this optimal action is unique, i.e., either  (;  *  ) <  ( * ;  *  * ) or there exists an  ∈ [] such that   (;  *  ) > 0 for all  ≠  * .Nevertheless, our analysis follows with minor modifications for the case of multiple optimal actions, using the same algorithm.One motivation of such objective is the epidemic control.In this scenario, the government (principle) has to take policies (e.g., lockdown, masks) to control the fraction of infected individuals (system).To mitigate potential public outcry and criticism, it is imperative for the government to ensure that the number of fatalities, influenced by both the policy and the infection rate, remains below a certain threshold.Moreover, in the decision-making process the government should take into account the utility including health costs (e.g., deaths and complications) and operational cost (e.g., treatment and economic implications).The government wishes to commit to the optimal policy, that maximizes utility while adhering to the constraints at the stable state, as quickly as possible.We aim to design a control algorithm for the principle that minimizes regret and constraint violation simultaneously, which are defined as follows, where the expectation is with respect to the randomness in algorithm.With these metrics, a good algorithm would find the optimal action  * as quickly as possible and then commit to it.To facilitate the algorithm design for our model, in addition to Assumption 1, we also make the following assumptions which are standard in the online learning literature [53].
Assumption 2. We assume that Remark 1. Actually, our model can be viewed as a specialized instance of adversarial constrained bandits that accommodates any sequence of rewards and constraint values.Nevertheless, the regret definitions in the literature (e.g., [9,25]) is significantly weaker than ours as they compare to the best action in hindsight.In contrast, our regret resembles the more demanding regret of stochastic bandits, which compares to the "absolute" optimal action.Consequently, existing adversarial (constrained) bandit algorithms fail to provide meaningful regret guarantees for our model.In our notation, the regret for the adversarial bandit dilemma would be expressed as max  ∈ A   =1  (;   ) − E   =1  (  ;   ) .This regret definition considers the state sequence {  }   =1 as fixed and overlooks the influence of our action sequence {  }   =1 on the state sequence.In the context of epidemic control, this metric appears to let the government strive to identify the best action given the number of infections over time, as if these numbers were inevitable despite the government's improved actions.
We present our results by beginning with a simple case where A is finite, i.e., A = {1, ...,  }, and then we will extend it to the case where A is infinite/continuous.

FINITE-ACTION CASE
This section presents our algorithm design and corresponding analysis for the case of A = {1, ...,  }.

Algorithm
To provide intuition and for the ease of understanding of algorithm design, we start by analyzing the simpler case where the observed rewards and constraint values are noiseless.
Noiseless case.In the noiseless case, i.e.,   = 0 and   = 0 for all , the principle can directly observe   =  (  ;   ) and   = (  ;   ).A naive algorithm would pick each action consecutively a sufficiently large number of times (denoted by  try ), to allow the system to approach the stable state corresponding to this action.The principle will then know the reward and constraint values at the stable state for each action with arbitrarily low error, and can then commit to the optimal action.This naive algorithm can achieve sublinear regret if  try is above a threshold, which depends on convergence rates and the suboptimality gap.Since the suboptimality gap is unknown to the principle, the naive algorithm achieves linear regret in general.
To overcome this issue, we chose actions based on the confidence bound strategy (optimism in the face of uncertainty).The idea behind the confidence bound strategy is to play actions that seem to (i.e., with high confidence) be feasible and have good stable reward for an increasing number of rounds to allow them to reach closer to convergence.Since it is still necessary to play an action consecutively for some time to get an accurate enough estimation of the reward and constraint values at the stable state given that action, we choose an action to be played consecutively over a full "epoch", instead of switching actions at every time step.However, we do not want to always wait for the system to converge as it might waste precious time and incur significant regret or constraint violation.Thus, it is important to choose an appropriate epoch length.Next, we explain this epoch-based confidence bound strategy in detail.
Denote   as the action taken in epoch ,  , as the number of epochs action  has been chosen up to the end of epoch , ℓ  as the length of epoch , and   as the total number of timesteps up to the end of epoch .The action in the next epoch  +1 is chosen as where x, is high-confidence upper bound on stable reward  *  and ŷ, is the high-confidence lower bound on stable constraint values  *  .The intuition behind (2) is selecting actions by being optimistic (i.e., more aggressive) about their stable rewards and pessimistic (i.e., more conservative) about their stable constraint values.Therefore,  +1 is chosen to be with highest optimistic stable rewards amongst all the actions that are plausibly feasible:  ∈ { : ŷ, ≤ 0}.The principle chooses this action for the complete epoch and observes the reward and constraint values incurred at each timestep of that epoch.The remaining thing is to determine the expression of ( x, , ŷ, ).Note that if an action  is taken for ℓ timesteps consecutively after the time , given Assumption 1 we can deduce that and by the Lipschitz properties of the reward and constraint functions: and x, = x,−1 and ŷ, = ŷ,−1 for all other actions  ≠   .To identify unfeasible actions which then no longer to be chosen, we should ensure that ŷ, shrinks to  *  as  , grows.To achieve this, we let the epoch length increase with the number of epochs the chosen action has been played before, and thus choose the length of epoch  as ℓ  =  (exp(   , )).Consequently, we can spend time weakening the convergence noise only for promising actions.
Noisy case.We now turn to the general case where   ≠ 0 and   ≠ 0. The confidence bound strategy developed for the noiseless case cannot be directly applied here, as it only utilize the final observed reward and constraint values, which may be highly noisy.To address the challenge posed by the noise, we should incorporate multiple observations and average them.The estimated expected reward and constraint values corresponding to the stable state of an action has two kinds of errors -due to the i.i.d.noise and due to the distance from the stable point (i.e., "convergence noise").However, naive averaging of all observed rewards or constraint values for an action reduces the i.i.d.noise but increases the convergence noise, as early incurred rewards or constraint values far from the stable reward or stable constraint values.Similarly, considering only the most recent reward or constraint values results in low convergence noise but high i.i.d.noise.Therefore, it is necessary to strike a balance between these two kinds of errors when averaging multiple observations.We will now extend the confidence bound strategy to the noisy case, taking into account this trade-off, thereby creating the convergence and confidence bound strategy.
and x, = x,−1 for  ≠   . ( and ỹ, = ỹ,−1 for  ≠   . 4: For epochs  ≥  do (1) Construct permissible feasible actions set: (2) Select action optimistically from Π  : and x, = x,−1 for  ≠   . ( and ỹ, = ỹ,−1 for  ≠   . ( Specifically, we maintain the same epoch-based structure as before and choose ℓ  = 2 exp    , = 2 exp    ,−1 + 1 (The multiplicative factor 2 ensures that ℓ  /2 is an integer).At epoch , we average the observations made during the second half of this epoch to estimate ( *   ,  *   ) for action   .That is to say, we define and ỹ  , = , along with x, = x,−1 and ỹ, = ỹ,−1 for other actions  ≠   .Here x, and ỹ, can be viewed as the empirical estimates for  *  and  *  at epoch , respectively.The action selection rule follows the same idea as the confidence bound strategy for the noiseless case.Based on ( x, , ỹ, ), we compute lower bounds on stable constraint values ŷ, and upper bounds on stable rewards x, such that ŷ, ≤  * , and x, ≥  *  with high probability.We then construct a set of "permissible feasible actions" denoted by Π  = { : ŷ, ≤ 0}, i.e., all the actions that are plausibly feasible given the information available up to epoch .Similar idea has been used in [3,35,38,40].At the end of epoch  (or the beginning of epoch  + 1), we select  +1 to maximize x, amongst  ∈ Π  , i.e., the action for the next epoch is chosen as follows: Next we provide the details of how to compute x, and ŷ, .For notational convenience, we denote   as the last epoch action that  was played before the end of epoch .Then ℓ   is the length of the last epoch that action  was played before the end of epoch , i.e., ℓ   = 2 exp  , .For all actions  ∈ A, we define: where   = 1/ 3  .We will prove (in Lemma 1) that x, serves as the upper confidence bound for  *  and ŷ, serves as the lower confidence bound for   , respectively.Specifically, x, ≥  *  and ŷ, ≤  *  both hold with a probability of at least 1 −   .The optimism of the action selection rule allows us to explore for high rewards, while the concentration of ŷ, as  , grows helps identify unfeasible actions, which are then excluded from further consideration.The complete details of this confidence bound strategy are outlined in Algorithm 1, named Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B).

Theoretical Results
In this section, we introduce the performance bounds and conduct a corresponding analysis for OP-C2B.The performance bounds for OP-C2B are characterized by the inefficiency or suboptimal gap of suboptimal actions.For a suboptimal action  ≠  * , its suboptimal gap comprises two components.One is the reward gap Δ  =  *  * −  *  , which characterizes the difference in stable reward compared to  * .The other is the violation gap, defined as: This characterizes the minimum violation in the stable state (Γ  = 0 means  is feasible).Recall that  * = arg max  ∈ A  *  , s.t. *  ≤ 0, and we assume  * is unique.Thus, (1) When the sequence {  } is of polynomially decaying, i.e.,   =  ( − 2 ),  2 > 0, then we have
In the regret bound derived in part (1), the term  ( : from the suboptimality of suboptimal but feasible actions.This involves the summation of occurrences where a suboptimal but feasible action  is taken, with each occurrence multiplied by its corresponding reward gap Δ  ).The term  ( :

𝒂
) arises from the suboptimality of unfeasible actions, involving the summation of occurrences where an unfeasible action  is taken, with each occurrence multiplied by its corresponding reward gap Δ  (Actions with Δ  ≤ 0 are excluded as they do not contribute to regret).The term  ( : arises from the convergence errors of choosing unfeasible actions.The reason this term is inversely proportional to their violation gaps is that, if we aim to identify the unfeasible action  with highprobability (i.e., 1 − 1/ ), we must select it for at least  (log /Γ 2  ) times due to the i.i.d.noise.The remaining term  (   (1− 2 ) + ) arises from the convergence errors of choosing the optimal action  * .Notably, our regret is linearly dependent on the action space size.This is because in order to identify suboptimal actions with sufficient confidence, each suboptimal action needs to be executed a certain number of epochs, which depends on its reward gap or violation gap.This necessitates the linear dependence on the action space size for regret.The constraint violation bound derived in part (1) follows the same intuition.When the system has at least a linear convergence rate (i.e.,  2 ≥ 1), the quantity  ( (1− 2 ) + ) in convergence errors term (which is exactly the summation  =1   ) can be reduced to Õ (1).This result aligns with the performance bounds derived in part (2), as the exponential convergence rate is suplinear.
An important observation from Theorem 1 is that OP-C2B achieves logarithmic regret and constraint violation as long as the system convergence rate is linear or superlinear.This aligns with the best performance bounds any algorithm can achieve under the cases where the system immediately converges to the stable states of principle's chosen actions (in this case our model can reduce to stochastic bandits where achieving logarithmic bounds is optimal) or the principle has a predictive mode (i.e., the principle knows the corresponding stable state  *  for each action ).
Remark 2. In practice, we can integrate a time-varying slackness parameter into the construction of the permissible feasible actions set, i.e., Π  = { : ŷ,−1 +   ≤ 0}, to control the "pessimistic level" of the estimated constraint values.This integration allows for a trade-off between regret and constraint violation, for example, achieving negative regret at the expense of increased constraint violation.

Analysis
Here, we introduce our proof strategy by breaking the proof of Theorem 1 into lemmas.Compared to the analysis of optimistic-pessimistic algorithms in the literature of constrained bandits, significant modifications are needed due to the convergence noise of the observations and the epoch-based structure.In particular, OP-C2B has to balance the trade-off between i.i.d.noise and convergence noise, and this trade-off is unique in our model.
The Lemma 1.We have the following statements for OP-C2B: (1) The inequalities both hold with a probability of at least 1 −   : .

𝒂
. Here  (1)  , is the number of epochs required for the noise term ℓ  log 2   to be sufficiently small, while  (2)   is the number of epochs required for the second term =ℓ  /2   , stemming from the convergence noises, to be sufficiently small.A large  , implies that  was played consecutively for a higher number of turns which brings the system closer to the stable state and leads to a more accurate estimation.
The second lemma shows that with high probability, OP-C2B can identify suboptimal actions once they have been chosen for sufficient turns.Moreover, it provides an upper bound on the probability of selecting a suboptimal action after it has been adequately chosen.Lemma 2. At the end of epoch , when the suboptimal action  ≠  * has been played for enough epochs such that ℓ   ≥ max{ℓ (1)  , , ℓ (2)   } (defined in the part (3) of Lemma (2)  ∉ Π  holds with probability at least 1 −   if  is unfeasible.
(3) Given parts (1) and ( 2), the probability that OP-C2B selects suboptimal action  ≠  * in the ( + 1)-th epoch is bounded by: Now we are going to bound the regret of OP-C2B.It is noting that we can decompose the instantaneous regret at time  as follows: The first term in (4) represents the difference in rewards at the stable state between the optimal action and a suboptimal one.Observe that the regret associated with the first term relies on the number of times of selecting a suboptimal action multiplied by the suboptimalit gap.The following lemma bounds the expected number of times (and not epochs) a suboptimal action is chosen.
Lemma 3.For any suboptimal action  ≠  * , the expected number of timesteps our algorithm chooses it at the end of epoch  can be bounded as follows: , where I{•} is the indicator function.
The second term in (4) arises from the convergence error (i.e., "convergence noise").Intuitively, by the Lipschitz property of the reward function, the regret accumulated due to this term within epoch  can be shown to be Therefore, the regret contributed by this term can be bounded as  (    =1   ) multiplies the number of action switches, which is in turn bounded by twice the number of epochs where suboptimal actions were played.We bound this number in the following lemma.Lemma 4. For any suboptimal action  ≠  * , the expected number of epochs OP-C2B chooses it can be bounded as follows: Combining the above lemmas gives the bound of regret at the end of each epoch, as shown in the following lemma.Lemma 5. Recall that   is the time at the end of epoch  .For any  , OP-C2B ensures that Similarly, the constraint violation at time  can be decomposed into the following two terms: The second term in (5) also arises from also the convergence error and can be analyzed in a same way as the second term in (4).For the first term in (5), since  *  ∈ [−1, 1]  for all  ∈ A, we could bound it as i.e., the constraint violation corresponding to the first term can still be expressed as the number of times a suboptimal action is taken.Combining Lemmas 3 and 4 we have the following lemma to bound the constraint violation at the end of each epoch.(1 ), + , ñ ( ) + ).
The next section deals with the continuous action set.

INFINITE-ACTION CASE
In this section, we generalize our algorithm and results from the case of a finite action set to the case of a continuous action set.The main challenge in this case is that the unknown functions,  *  =  (;  *  ) and  *  = (;  *  ), are often non-linear and may even be non-convex concerning  in various applications.Consequently, applying techniques from linear bandits or bandit convex optimization to this scenario is not directly feasible.An important way in the literature of bandit optimization to address general (e.g., non-linear and even non-convex) unknown objective functions is to consider a smoothness condition specified by a bounded norm of a Reproducing Kernel Hilbert Space (RKHS) associated with a kernel function.This enables the use of the Gaussian process (GP) method to uniformly approximate an unknown continuous function  given a set of (noisy) evaluations of its values at chosen actions [42].Therefore, to address the challenge in the case of an infinite set where the unknown functions  *  =  (;  *  ) and  *  = (;  *  ) cannot be wellparameterized by a convex function, we additional assume that A is compact, and  *  =  (;  *  ) and  *  = (;  *  ) are all fixed functions in a RKHS with a bounded norm (i.e., a measure of smoothness).Although our solution borrows some techniques from the GP method, due to the existence of convergence noise, the algorithm design and analysis would exhibit substantial differences.

Preliminaries: Gaussian Process (GP) Method
As a basis, we first review the background of the Gaussian Process (GP) method in this subsection.Classical Bayesian Optimization (BO) aims to maximize a fixed unknown function  : A → R, i.e., to identify the optimal point  * such that  * = arg max  ∈ A  ().To obtain enough information about black-box function  , the decision maker can sequentially inquire points and observe theirs function values (possibly with subgaussian noise).Formally, at each timestep , we can choose a point   ∈ A and observe a noisy function value   =  (  ) +   , where   is i.i.d.subgaussian noise with parameter .In fact, there are many approaches to this BO problem and a well-known method among these is the GP method, which assumes that  lies within some RKHS with a bounded norm.This assumption permits the construction of confidence bounds via GP regression.Usually, the GP method assumes that A ⊂ R  is a compact set endowed with a kernel function  (•, •) defined on A × A, and  has a bounded norm in the corresponding RKHS denoted by H  (A), i.e., ∥ ∥  ≤ .The idea behind the GP method is that it first puts a GP prior on  and obtains its posterior distribution after an inquiry of  .Then, it chooses the next inquiry point and updates the posterior distribution of  .When the posterior distribution is considered to be informative enough about the optimal point, we can get the final result from it.
The above proposition implies that for all  ∈ A,   () +     () and   () −     () are the upper and lower confidence bounds of  () given  samples, respectively.Build on these confidence bounds, several algorithms have been developed and analyzed in the existing literature.A particularly well-known example is GP-UCB, which selects   = arg max  ∈ A   −1 () +   −1   −1 () and achieves sublinear regret:   =1 ( ( * ) −  (  )) ≤ Õ ( √  ).Unfortunately, the results and analysis of GP-UCB cannot be directly applied to our model due to the impact of convergence noise.Specifically, the observations corresponding to  (,  *  ) and (,  *  ) are shifted by the convergence noise in our model.In the subsequent subsection, we propose an algorithm, OP-GP-C2B, that combines OP-C2B and GP-UCB to overcome this issue.The OP-GP-C2B algorithm requires the following assumption.(2) Select action optimistically from Π  :   = arg max  ∈Π   0, ().
/ / Update GP posterior mean and covariance with the new evaluations added.
(  Our algorithm OP-GP-C2B for the case of continuous action set still works in epochs, indexed by  = 1, .., each of which consists of a batch of timesteps to eliminate the impact of the convergence noise.In OP-GP-C2B, the -th epoch length, ℓ  , is chosen to be 2 max{  , 1}, where  is a tuning parameter which will be specified later.The purpose of this epoch length setting is to ensure that the epoch length increases with the increase of epochs.This is because actions chosen at later epochs tend to be closer to the optimal action as the estimation becomes more accurate.With the same intuition behind OP-C2B, at each epoch , OP-GP-C2B also computes the empirical estimates for ( where  0 ( Note that compared to the classic confidence bounds in Proposition 1, this result incurs an additional term  , , which is exactly the accumulated shifted degree in observations.Given the confidence bounds developed in Lemma 8, we have the following corollary.With probability at least 1 − , the following holds for all  ∈ A and  ≥ 1, That is to say, [ 0, (),  0, ()] and [ , (),  , ()] are the high-probability confidence intervals for  *  and ( *  )  , respectively.Corollary 1 plays an important role in the action selection rule of OP-GP-C2B, and the details of the action selection rule as given below.
Specifically, with the same intuition as OP-C2B, at epoch  OP-GP-C2B chooses the action that maximize the upper confidence bound of the stable reward among all plausibly feasible actions given the information we have up to epoch , i.e., Otherwise we randomly choose an   from A if the auxiliary problem ( 11) is unfeasible.Fortunately, the feasibility of ( 11) is guaranteed with high probability 1 − 1/ (we choose  = 1/ ) as The complexity of solving the auxiliary problem (11), which may itself be non-convex, is much smaller compared to solving the original auxiliary problem (1) when the dimension of the input space is not too large.We illustrate the OP-GP-C2B algorithm in Algorithm 2.
Once the validity of the confidence bounds is established through Lemma 8, we can employ analysis techniques from GP-UCB, incorporating a modification to account for the epoch structure and convergence noise, to derive the performance bounds for OP-GP-C2B.This is formally stated in the following theorem.Theorem 2. The regret and constraint violation of our algorithm OP-GP-C2B are bounded as where ñ( ) is the last epoch which completed before time  and ñ( ) ≤ ( + 1) 1  +1  1  +1 .In particular, (1) When the sequence {  } is of polynomially decaying, i.e.,   =  ( − 2 ),  2 > 0, then we have (2) When the sequence {  } is exponentially decaying, i.e.,   =  (exp(− 1 )),  1 > 0, then we have Theorem 2 implies that OP-GP-C2B is able to maintain sublinear regret and constraint violation for polynomial convergence rate.This is achieved by setting  ≥ 2/ 2 .In particular, if  2 ≥ 4, OP-GP-C2B obtains Õ ( for any  > 0. Notably, if the system can immediately converge to the stable states of principle chosen actions (in such instances, our model simplifies to the standard constrained BO problem [60]), i.e.,  1 = ∞, the performance bounds for OP-GP-C2B can be enhanced to Õ ( √  ) by setting  = 0.This improvement arises because the term exp(− 1 ) 1−exp(− 1 ) in the performance bounds approaches zero as  1 → ∞).We remark that this result aligns with the best performance bounds achievable by any algorithm for BO.However, it remains unclear whether the performance bounds stated in Theorem 2 are optimal concerning  for general  1 > 0 or  2 > 0, posing an interesting open question.
Remark 3. One general way of addressing the infinite-action problem is to finely discretize (partition) the action space and treat the problem as a finite-action problem [7,19,28].For instance, dividing the continuous action space into intervals of fixed length  ( − ) (containing  (  ) points) allows for the direct application of OP-C2B, yielding performance bounds of Õ ( 1+ 2 +(1− 2 ) + +  1− ) for polynomial system convergence rates.Here, the first part arises from the  (  ) points and the second part is due to discretization error.However, this discretization-based approach cannot always ensure sublinear performance bounds in our model, due to the additional time-horizon dependent term  ( (1− 2 ) + ) in our performance bounds for the infinite-action case.

APPLICATIONS
In this section, we elaborate on two practical applications our framework is capable of modeling.Within these applications, the policymaker strives to maximize his utility while adhering to specified constraints once the system achieves stability.However, the impact of any policy cannot be instantly observed, as the system consists of players that interact in a game-theoretic manner and gradually converge to an equilibrium.

Load Balancing for Resource Allocation
We consider a system consisting of  players that each uses a mixture of  resources, and a manager that controls the resource price, thereby affecting the utility of each player [6].When every player  takes an action   = ( 1  , ...,    ) ∈ Z  , where    denotes the amount of -th resource player  uses, the resulting utility for player  is given by   (; ) =   () −  =1      .Here  = ( 1 , ...,   ) ∈ A are pricing coefficients determined by the manager, and   () is the reward function of player  that encodes the utility player  obtains for resource usage   , minus the cost incurred from resource utilization considering the loads determined by the actions of other players,  − .We assume the game constructed by these players is strongly monotone for a given .
The system is dynamic and evolves in discrete time.At each time , players respond to observed pricing coefficients   by selecting their actions { , }, while the manager receives utility of  (  ;   ) and observes the total loads  =1  , .Subsequently, the manager adjusts the pricing coefficients to   +1 based on past observations and his policies.Here one example of the manager's utility function can be  (; ) =  =1      , i.e., the fees paid by all players due to resource usage.The manager can tailor the utility function  (; ) to optimize application-specific objectives, such as enhancing system efficiency, complying with regulations, or minimizing operational costs.
Since the game is strongly monotone, if the manager keeps the price coefficients  fixed, players gradually improving their rewards using gradient descent can converge to the unique NE  *  in an exponential rate.More specifically, at each time , each player  updates their actions as follows: , +1 =  , + ℎ  (  ;   ), where ℎ  (;   ) =   (;   )/ , , and  is the stepsize of gradient descent.The game is strongly monotone means that for any  ∈ A, there exists a constant   such that the concatenation of all gradients,  (; ) = (ℎ 1 (; ), ℎ 2 (; ), ..., ℎ  (; )), satisfies that Given the condition that  (; •) is   -Lipschitz continuous, the following proposition [11] formally shows the exponential convergence rate to NE  *  : Proposition 2. For a sufficiently small step-size  ≤ 2     , the iterates given by (12) with   =  ensures that: However, in resource allocation scenarios, such an NE is typically infeasible for the manager as the loads are unbalanced.Players may persist in using an overloaded resource since it is optimal given that other players would not alter their choices.Overloaded resources can result in slow service, system failures, and high operational costs, while other resources remain underutilized.Therefore, another objective for the manager is to ensure that the resource allocation game converges to an NE satisfying the target loads constraint  =1  , ≤  * .Consequently, the manager's goal is to steer the system dynamics { , } towards a load-balanced NE  *  * such that his total utility can be maximized while the resource usage satisfies the given total loads constraints  * by controlling the non-negative pricing coefficients .Formally, the manager wants to steer the system dynamics { , } into the load-balanced NE  *  * , where In the electricity grid, the manager can be the utility company running demand-side management (DSM) to reduce energy production costs during peak hours [12,17,43].In wireless networks, the manager can be an access point coordinating a protocol to minimize device power consumption and network interference [67].Additional examples include load balancing in data centers and the optimization of parking resources.

AoI-aware Incentive Mechanism for Mobile Crowdsensing
We consider a typical mobile crowdsensing (MCS) system [58,61], comprising a cloud-based platform and a group of workers.The platform has a long-term sensing task, such as gathering the latest traffic data from various Points of Interests (PoIs).The workers, denoted by N = {1, 2, ...,  }, are some social network users willing to share data among themselves to reap additional social benefits, e.g., save data collection time and cost by piggyback, enhance social reputation, and more.In the beginning, the platform announces the task and the payment strategy.After that, each worker performs the sensing tasks and continuously collects data from some specified PoIs.Meanwhile, workers upload the latest data (the data is packed into fixed-size packets) to the platform with pre-determined update frequencies, and the platform pays the reward to each worker according to the current payment strategy.To avoid conflicts in this MCS system, the platform maintains a queue for the data from different workers, employing a First-Come-First-Service (FCFS) strategy.The decision variables in this MCS system include the data updating frequency for each worker and the unit-rewards profile for the platform.To prevent queue congestion, the platform needs to ensure that the total data updating frequency does not exceed a specified threshold [36].Moreover, the platform wishes the collected data as fresh as possible.Consequently, it tries to maintain the average Age of Information (AoI) of data uploaded by each worker below a given threshold.For clarification, we proceed to define and elucidate several crucial concepts and notations.
Data update frequency.The data update frequency of worker  refers to the frequency that the worker  collects and uploads the data to the platform, denoted by   .For convenience, we use  = ( 1 ,  2 , ...,   ) and  − to present the data update frequencies of all workers and all workers except worker , respectively.As previously mentioned, to avoid queue congestion, the total data update frequency of all workers must not exceed a constant p:  =1   ≤ p, i.e., ∥ ∥ 1 ≤ p. Unit-rewards profile.The reward that the platform pays to each worker  is proportional to its data update frequency.The platform determines the unit-rewards to all workers  = { 1 ,  2 , ...,   }, with   denoting the reward per data update frequency paid to worker .
Worker's utility.The utility of worker  refers to the net profit of this worker given by   (  ;  − ) =   •   + Φ(  ;  − ) −   (  ).Here, the first term   •   represents the reward paid by the platform to worker .The second term Φ(  ;  − ) =  ∈ N   ,     denotes the social benefits of worker , where N  is the set of all socially-connected neighbors of worker , and  , indicates the social network influence of worker  on worker .This term arises from the fact that the workers are social network users, thus they can share data with each other via the social network and gain social benefits from the shared data [5,8,18,23].The third term   (  ) represents the cost function of worker , which is monotonically increasing and strictly convex w.r.t its data update frequency   .Notably, with these dependable utility functions, the workers form an  -player Bayesian continuous game and would converge to a NE with time evolves if the principle keeps the same unit-rewards profile , i.e., the joint data update frequency profile ( 1 , ...,   ) converges to  *  .The convergence rate depends on the information structure of this game and information feedback.For example, when the cost function   (•) is quadratic, a widely-used assumption in the literature [14,44,63], the players can converge to  *  in an exponential convergence rate even in the online learning setting, using common no-regret strategies such as online gradient descent.
Platform's utility.The platform's utility is the income gained from all collected data minus the total rewards paid to workers given by Here, the first term is the income of the platform that relies on the data update frequencies of all workers.The second term is the total rewards paid to all workers.
AoI constraints.The AoI of the data uploaded by a worker to the platform is the difference between the current time and the creation time of the data.The average AoI of worker  refers to the mean AoI value of the data he uploads to the platform, denoted as   .Intuitively, it depends on   and  − = ( 1 , ...,   −1 ,  +1 , ...,   ) [58], thus we will adopt the notation   (  ,  − ) or   ( ).The platform needs to make sure that   (  ,  − ) does not exceed the given threshold .
Consider the above MCS system is dynamic and evolves in discrete time.Let   and   be the unit-rewards profile and joint data update frequency profile at time , respectively.The objective of the platform is to commit to the optimal strategy ( * ,  *  * ) that satisfies

CONCLUSION
In this paper, we presented a new model that designed to deal with systems that converge to a stable state over time.The principle can take actions to control this system that dictate the resulting stable state.While the principle only observes the real-time impact of their actions, their aim is to find the action that gives the best performance at the stable state.We analyzed two scenarios involving finite and infinite action sets by proposing corresponding algorithms and proving performance bounds for them.The key innovation of our developed algorithms is the use of a combination of confidence bounds and "convergence bounds" that bound how far the system is from the stable state at any given point.We also showed two applications, mobile crowdsensing and resource allocation games, that fall within the framework of our model.

A APPENDIX
A.1 Proof of Lemma 1 We begin by the proof of part (1).Recall that the empirical estimates ( x, , ỹ, ) for  after the epoch  are given as follows, , where the interval [ ′  ,  ′′  ] is the last epoch in which action  was played.Now, where the inequality (I) comes from the Lipschitz property of the reward function.Similarly, for any  ∈ [], we have that According to the Chernoff bound for subgaussian random variables, it holds that with probability at least 1 − : , completes the proof of part (1).

A.3 Proof of Lemma 3
Recall that E  =1 ℓ  I{  = } denotes the expected number of times a suboptimal action  is played.We split it as follows, , , ,  , ≥  (2)    .
Here 2 is the timesteps action  being played within first  epochs.We next bound each of remaining three terms individually.
For the first term, since  (1)  , is non-decreasing w.r.t , we have Similarly for the second term, we have .

A.4 Proof of Lemma 4
Similar as the analysis in Lemma 3, we decompose E  =1 I{  = } as follows, , .
Here 2 is the timesteps action  being played within first  epochs.We next bound each of remaining three terms individually.Since  (1)  , and  (2)   are all non-decreasing w.r.t , the first and second terms can be bounded as follows, , .
For the third term, it holds that As noted before, the expected cumulative regret at the end of epoch  can be split as follows: The first term can be rewritten as Now for the second term we have where { ( ) }  denotes the times at which epochs end and actions are switched.Note that by the Lipschitz property of the reward function, This implies that ( *   ) + .
We bound the second term as We also rewrite the first term as where { ( ) }  denotes the times at which epochs end and actions are switched.Note that This implies that Putting all things together, we have that for any ), where the second inequality uses lemmas 3 and 4.

A.7 Proof of Lemma 7
Using Lemmas 5 and 6, we now have bounds on the regret and constraint violation till the end of epoch  .We now wish to extend this for all times.Let  be any arbitrary time and let ñ( ) denote the last epoch which completed before time  , i.e.,  ñ ( ) <  and  ñ ( )+1 >  (we ignore the case where  ñ ( ) =  as we already have a bound for that).Then Using the Lipschitz property of the reward and constraint functions, we can obtain that (as in the proofs of Lemmas 5 and 6) Here Ñ denotes the number of switches till timestep  .Then, Ñ ≤ Ñ ñ ( ) + 1 (there can be at most 1 more switch), where Ñ ñ ( ) is the number of times actions are switched till epoch ñ( ).Now note that by splitting the sum   =1 ( *  * −  *   ) or   =1 ( *   ) + into two partitions, the sum from  = 1 to  =  ñ ( ) and the sum from  =  ñ ( )+1 to  =  , we get Similarly, ),
A.9 Proof of Theorem 2 Before proving Theorem 2, we first have the following helpful Proposition that provides the theoretic guarantees achieved by GP-UCB.
All the following statements are conditioned on the above joint event happening, which has a probability of at least 1 − .
By setting  =  * in (34) and using the feasibility of the optimal action  * (i.e., ( *  * )  ≤ 0, ∀ ∈ []), we have that  , ( * ) ≤ ( *  * )  ≤ 0, ∀ ∈ [].Thus, the optimization problem in Algorithm 2 is feasible, and  * is one feasible solution of it.According to the definition of   , it holds that  ,0 (  ) ≥  ,0 ( * ). ( Then we have  where the last inequality is due to the fact that ℓ  ,  0, and  0, are all monotonically increasing with respect to .For the second term, we have .
This completes the proof of Theorem 2.
Received January 2024; revised April 2024; accepted April 2024 The following theorem bounds the regret and constraint violation of OP-C2B.Theorem 1.The performance guarantees of OP-C2B are as follows.Proc.ACM Meas.Anal.Comput.Syst., Vol. 8, No. 2, Article 23.Publication date: June 2024.Learning the Optimal Control for Evolving Systems with Converging Dynamics 23:11
first lemma provides probabilistic upper bounds on || x, −  *  || and || ỹ, −  *  ||, which motivates the definition of x, and ŷ, .It characterizes how far away our estimates of the stable reward and stable constraint values deviate.Both || x, −  *  || and || ỹ, −  *  || consist of two terms -one due to i.i.d.noise and the other due to "convergence noise".
Similar as the proof of Lemma 5, the constraint violation at the end of epoch  can also be split as follows: Proc.ACM Meas.Anal.Comput.Syst., Vol. 8, No. 2, Article 23.Publication date: June 2024.