Online Bidding Algorithms for Return-on-Spend Constrained Advertisers

Online advertising has recently grown into a highly competitive and complex multi-billion-dollar industry, with advertisers bidding for ad slots at large scales and high frequencies. This has resulted in a growing need for efficient"auto-bidding"algorithms that determine the bids for incoming queries to maximize advertisers' targets subject to their specified constraints. This work explores efficient online algorithms for a single value-maximizing advertiser under an increasingly popular constraint: Return-on-Spend (RoS). We quantify efficiency in terms of regret relative to the optimal algorithm, which knows all queries a priori. We contribute a simple online algorithm that achieves near-optimal regret in expectation while always respecting the specified RoS constraint when the input sequence of queries are i.i.d. samples from some distribution. We also integrate our results with the previous work of Balseiro, Lu, and Mirrokni [BLM20] to achieve near-optimal regret while respecting both RoS and fixed budget constraints. Our algorithm follows the primal-dual framework and uses online mirror descent (OMD) for the dual updates. However, we need to use a non-canonical setup of OMD, and therefore the classic low-regret guarantee of OMD, which is for the adversarial setting in online learning, no longer holds. Nonetheless, in our case and more generally where low-regret dynamics are applied in algorithm design, the gradients encountered by OMD can be far from adversarial but influenced by our algorithmic choices. We exploit this key insight to show our OMD setup achieves low regret in the realm of our algorithm.


Introduction
With the explosive growth of online advertising into a billion-dollar industry 1 , auto-biddingthe practice of using optimization algorithms to generate bids for ad slots on behalf of advertisers -has emerged as a predominant tool in online advertising [ABM19; BG19; BCHIL21; GJLM21; DMMZ21; BDMMZ21a; BDMMZ21b].Unlike manual CPC ("cost-per-click") bidding, which requires advertisers to manually update bids for new search queries, auto-bidding requires advertisers to specify only their high-level objectives and constraints.The advertising platform then deploys its auto-bidding agent, which, based on its underlying optimization algorithms, transforms these inputs into fine-grained bids.Thus, designing an efficient bidding algorithm for advertisers to achieve their targets under constraints constitutes a central problem in the auto-bidding domain.
Throughout this paper, we focus on the return-on-spend (RoS) constrained bidding problem for a single learner (auto-bidder).In this problem, the RoS constraint requires that the ratio of total value of the advertiser to its total payment exceed at least some specified target ratio.In practice, the RoS constraint may capture other similar constraints, e.g., target cost-per-acquisition (tCPA) and target return-on-ad-spend (tROAS). 2 Additionally, we investigate the well-studied total budget constraint, which specifies an upper bound on the auto-bidder's total expenditure; our proposed algorithm, though tailored to the RoS constraint, easily adapts to the budget constraint, as well.
We study the online bidding algorithm for a single auto-bidder (learner) in the stochastic setting: In each round, an ad query (request) and auction are generated i.i.d.from an unknown distribution, after which the learner submits a bid to compete for the ad query in the auction.Given all bids for this query, the auction mechanism specifies which advertiser wins the opportunity to show its ad to the user and how much it needs to pay.Before placing its bid, the learner observes only the value of this ad query; after placing the bid, it receives its bid's auction outcome (i.e., the allocation and payment).From the perspective of auto-bidding, the platform needs to design an online bidding algorithm for the learner to maximize its target subject to the RoS and budget constraints.
For theoretical simplicity, we focus only on the value-maximizing auto-bidder, i.e., the learner aims to maximize the total realized value (or conversions) for T i.i.d.randomly drawn ad queries subject to RoS and budget constraints.Our results can be easily extended to handle other types of auto-bidders (e.g., utility maximization or a hybrid of value maximization and utility maximization [BDMMZ21a]).

Results and Techniques
Our main result is as follows.
Theorem 1.1 (Informal version; see Theorem 5.2).We provide an algorithm (Algorithm 5.2) for value maximization under RoS and budget constraints.For a T -length input i.i.d.sequence of ad queries, our algorithm provably attains O( √ T log T ) regret while respecting both the budget and RoS constraints.
To the best of our knowledge, ours is the first algorithm to attain near-optimal regret while satisfying both budget and RoS constraints in any outcome.In doing so, we improve upon the prior work of [BLM20], which obtains similar guarantees under only budget constraints.Our result holds for i.i.d.input sequences and under an additional mild technical assumption on the input distribution. 1

See eMarketer 2 See Google ads support page
We build up to our result by first obtaining a O( √ T )-regret algorithm (Algorithm 3.1) for value maximization under approximate RoS constraints (i.e., allowing for an at most O( √ T log T ) violation of this constraint).A simple modification to this algorithm yields Algorithm 4.1, which obtains a O( √ T log T ) regret guarantee under a strict RoS constraint (Theorem 4.1).We then combine Algorithm 3.1 with ideas from [BLM20] to design Algorithm 5.1, which attains a O( √ T )-regret under two constraints: strict budget constraints and approximate RoS constraints (with a O( √ T log T ) violation cap).Finally, unifying ideas from Algorithm 4.1 and Algorithm 5.1 yields our main result.
Underlying all the algorithms we propose in this paper is a primal-dual framework similar to that used by [BLM20].Such a framework allows our algorithms to adapt to the changing values and prices of input queries while respecting the advertisers' specified constraints and goals over the entire time horizon.In particular, the dual variable (which tracks the constraint violation) is updated via online mirror descent (OMD) using the generalized cross-entropy function, which imposes a large (exponential) penalty on constraint violation.
An immediate technical challenge in attempting to use generalized cross-entropy as the mirror map is its lack of strong convexity on the non-negative orthant.While this is an issue in [BLM20] as well, in this case, the fixed budget constraint bounds the corresponding dual variable by a constant, which in turn implies the desired strong convexity on the space over which their algorithm operates.In our case (with the RoS constraint), there exist example inputs on which no such bound exists, which necessitates a novel analysis that circumvents the lack of strong convexity.
Our key insight is that while the low-regret guarantee of OMD with a strongly convex mirror map holds even when the gradients seen by OMD are adversarial, when we use OMD in our algorithm to update the dual variable, the gradients used in the update are controlled by how our algorithm sets the primal variables (i.e., bids); this connection is sufficient to give our algorithm the low-regret guarantee despite not using a strongly convex mirror map.We use this connection in a white-box adaptation of the OMD analysis tailored to our algorithm, as well as properties of the generalized cross-entropy function from the (offline) positive linear programming literature [AO14].
As a final remark on our technique, to turn an algorithm that only approximately satisfies a constraint to one that strictly satisfies it, we propose a simple strategy: We first submit a sequence of bids that lets us accumulate a slack on the RoS constraint, followed by the existing algorithm, which suffers some bounded constraint violation (c.f.Section 4).The first phase builds up enough slack (at the cost of bidding sub-optimally) to compensate for the constraint violation from the later iterations.This allows us to trade off the violation on the RoS constraint with the objective value.We anticipate that this simple idea might be applicable in other contexts but note that such a trade-off is typically easier in offline optimization via retrospectively modifying the solution (e.g., by scaling or truncating).

Related Work
Our problem falls under the broader umbrella of online optimization under stochastic time-varying constraints, and it has seen a long line of research by various research communities, e.g.[MTY09; MJY12; MYJ13; YNW17; YN20; BLM20; AD14; CCMRG22; BKS18; ISSS19].From a technical standpoint, our primal-dual approach is similar to that in [BLM20]; however, the RoS constraint is not a packing-type constraint as studied in their work as well as in [BKS18; ISSS19].There also exist papers that study a variant of our problem with a constraint class that contains as a special case our RoS constraint (e.g.[AD14; CCMRG22]); however, these general techniques do not provide guarantees as strong as ours, as we elaborate next.
For example, a recent work [CCMRG22] gives a primal-dual framework using regret minimization, which, when adapted to our bidding problem under the RoS constraint, achieves Õ(T3/4 ) regret with Õ(T 3/4 ) constraint violation (both with high probability).Both bounds are polynomially weaker than our guarantees (that further hold deterministically).Their bounds can be improved to Õ(T 1/2 ) under a 'strictly feasible' assumption, which is essentially Assumption 4.1 in our context; in contrast, we guarantee strict constraint satisfaction under that assumption.Moreover, their algorithm uses techniques from the multi-armed bandits literature, thus requiring the space of values and bids to be of finite size n v , n b respectively, and their regret bound scales with n v √ n b .Our algorithm works directly with continuous values and bids.
Another example is [AD14], which considers general online optimization with convex constraints.This work uses black-box low-regret methods that rely on a globally strongly-convex regularizer over the dual space, and a sub-linear regret bound is attainable only when the dual space is wellbounded (e.g. a scaled simplex) or the dual variable can be projected onto such a space without incurring too much additional regret.This canonical approach turns out to be difficult for the RoS constraint, which can incur poor problem-specific parameters in generic guarantees.As a result, this technique cannot give sub-linear regret for the RoS constraint.To circumvent this issue, we rely on problem-specific structure rather than globally strongly-convex regularization.
In the more specific domain of online bidding under constraints, a closely related work is [GJLM21] which investigates the same problem we do but with the RoS and budget constraints holding only in expectation over the distribution; in contrast, our constraint guarantees hold for any realization of samples.We note that [GJLM21] give an example where bidding based on the optimal offline (fixed) dual variable cannot achieve sub-linear regret.This does not contradict our results since our algorithm is adaptive rather than trying to converge to the optimal offline dual variables and then bid based on fixed dual variables afterwards.Our algorithm keeps updating our dual variables based on the previous outcomes, and this takes advantage of stochastic information to balance the objective and constraints violation.
The problem of learning to bid in repeated auctions has been widely studied in both academia and industry, e.g.[BCIJEM07; WPR16; FPS18; BFG21; NCKP22; NS21; HZFOW20].These papers mainly abstract the problem of learning to bid as contextual bandits but do not incorporate constraints into them.Beyond this, there have been some work on bidding under budget constraints, e.g., [BG19; AWLZHD22], however, these papers focus on utility-maximizing agents with at most one constraint.[CCKS22] also considers multiple different constraints in online bidding algorithms, however, they directly add a regularizer of one non-packing constraint in the objective and apply the standard dual mirror descent approach to design the algorithms.Their regret bound is measured against this relaxed objective, whereas ours is relative to the adaptive optimal benchmark.Finally, loosely related work includes the AdWords problem [MSVV07; DH09], which focuses on budget management for multiple bidders to maximize the seller's revenue; in contrast, we focus on the design of online bidding algorithms for a single auto-bidder with RoS and budget constraints.

Preliminaries
We consider an online bidding model for a single learner (auto-bidder): At each time step t, nature stochastically generates for the learner an ad query associated with a value v t ∈ [0, 1] and an auction mechanism (x t , p t ), where x t : R ≥0 → [0, 1] is an allocation function and p t : R ≥0 → [0, 1] the expected payment rule. 3We assume the following stochastic model: (v t , x t , p t ) are drawn independently and identically (i.i.d.) from an unknown distribution.At each time step t, the value v t is known to the learner before making a bid, and the learner decides its bid b t given v t and historical information.At the end of time step t, the learner observes the realized outcome from the auction mechanism, i.e., x t (b t ) and p t (b t ).
We focus only on auctions that are truthful.This requires that the allocation function x t (b) be non-decreasing with the input bid b and the payment function p t be characterized by the pioneering work of [Mye81] as x t (z)dz. (2.1) For instance, the well-known second-price auction for a single item is truthful, and its payment function satisfies Equation (2.1).Note the payment must be zero when the allocation is zero, and the payment is also at most the bid.This work also assumes v t • x t (b t ) to be the realized value of the learner in each round.
We design online bidding algorithms to maximize the learner's total realized value subject to an RoS constraint.Formally, the optimization problem under RoS constraint we study is where RoS > 0 is the target ratio of the RoS bidder.Throughout the paper we assume without loss of generality4 that RoS = 1.As noted in Section 1, our results can be extended to handle different learner objectives, e.g., a hybrid version between utility maximizing and value maximizing To simplify the notation, we denote the difference between value and price in iteration t as and using this notation, the RoS constraint in Problem 2.2 may be stated as (2.4) Our algorithm also extends to the bidding problem subject to an additional budget constraint. (2.5) where ρT is the budget and ρ > 0 (assumed a fixed constant) measures the limit of the average expenditure over T rounds (ad queries).
To collect notation, we denote the sample (ad query and auction) at time t as a tuple γ t = (v t , p t , x t ) and assume γ t ∼ P for all t ∈ [T ].We denote the sequence of T samples by − → γ := {γ 1 , γ 2 , . . ., γ T } ∼ P T and sequences of length ℓ = T by − → γ ℓ where needed.
Analysis setup.We use the notions of regret and constraint violation to measure the performance of our algorithms.To define the regret, we first define the reward of Alg for a sequence of requests − → γ over a time horizon T as (2.6) Next, we define the optimal value in the same setup as for Alg as where B is the exact set of constraints.These definitions lead to the definition of regret of Alg in this setup as Regret(Alg, We remark that we define Reward for some specific input sequence, whereas Regret is defined with respect to a distribution.Finally, we use online mirror descent as a technical component in our analysis.In this regard, we use to denote the Bregman divergence of y in reference to x, measuring with the distance-generating function ("mirror map") h.We brief review online mirror descent in Appendix D.

Bidding Under an Approximate RoS Constraint
In this section, we design and analyze an algorithm for Problem 2.2 allowing for bounded sublinear violation of the RoS constraint.To this end, we first rewrite Problem 2.2 as the equivalent problem in which the inner minimization strictly enforces ≥ 0 via the variable λ that applies an unbounded penalty for the constraint violation.Our algorithm iteratively updates, in each iteration t, the bid b t and a dual variable λ t that captures the current best penalizing (dual) variable λ, described shortly.
Updating the bid.Based on the formulation in Problem 3.1, our algorithm chooses the bid b t as the maximizer of the penalty-adjusted reward of the current round, with the penalty applied by the current dual variable λ t : where the final step is because of the truthful nature of the auction. 5The final expression for b t is consistent with the setting that we first observe only the value v t before making the bid.
Updating the dual variable.To maintain a meaningful dual variable, we relax the penalty on the constraint violation in Problem 3.1 by adding a scaled regularization function h(λ).This regularizer prevents λ from getting too large: where α > 0 is the scaling factor of the regularizer to be set later.At any iteration t, the value of λ t+1 is chosen to be the minimizer of the inner constrained minimization problem (until iteration i = t).While there exist a number of valid regularization functions, in this paper, we choose the generalized negative entropy h(u) = u log u − u, which gives the following expression for λ t+1 .
Through this rule, a net constraint violation (i.e., t i=1 g i (b i ) ≤ 0) makes the next dual variable exponential in the net violation, which in turn shrinks the next bid (in Equation (3.2)); on the other hand, an accumulated buffer in the net constraint violation (i.e., t i=1 g i (b i ) > 0) encourages the next dual variable to be small, allowing the next bid to grow.We demonstrate in Appendix A.1 that the use of another valid mirror map on this domain, h(u) = 1 2 u 2 , does not provide a sufficiently strong regret or constraint violation guarantee.Intuitively, the reason for this is that the squared penalty is simply not as strong as the exponential.
The final algorithm with these choices of b t and λ t is stated in Algorithm 3.1.
Algorithm 3.1 Bidding algorithm of the RoS agent operating under approximate constraints in truthful auctions (i.i.d.inputs), with mirror map h(u) = u log u − u.
1: Input: Total time horizon T and requests − → γ from the distribution P T .2: Initialize: Initial dual variable λ 1 = 1 and dual mirror descent step size α = 1 Observe the value v t , and set the bid b t = 1+λt λt v t .

5:
Observe the price p t and allocation x t at b t , and compute

Analysis of Algorithm 3.1
The main export of this section is the following theorem of Algorithm 3.1 for Problem 2.2.The proof of the theorem is a straightforward application of our chief technical results Lemma 3.1 and Lemma 3.2 (with λ = 1/T ), which provide guarantees on the constraint violation and regret, respectively.We focus our discussion on these two lemmas, deferring the proof of the theorem to Appendix A.
Theorem 3.1.With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 3.1 on Problem 2.2 is bounded by Further, the violation of the RoS constraint is at most 2 √ T log T .

Bound On Constraint Violation of Algorithm 3.1
To conclude that the constraint described by Inequality (2.4) is violated by only a small amount in Algorithm 3.1, we observe that when the cumulative violation is (non-trivially) larger than 1/α, the exponential function quickly makes λ t huge; in turn, our bid b t = v t + vt λt prevents us from overbidding.Formally, we show the following result, later used in Theorem 3.1 to obtain the stated constraint violation bound.(The bound could possibly be sharpened to get rid of the log.)Lemma 3.1.Consider the sequence {λ t } T t=1 starting at λ 1 = 1 and evolving as Proof.Based on the update rule for λ t , we know T log T , we are done.If this is not the case, let T ′ be the last time that − T ′ t=1 g t (b t ) ≤ √ T log T , so we know for any t > T ′ + 1, the dual variable λ t must be larger than T since where the second step used Inequality (3.6) to bound the terms after iteration T ′ and the fact that there are at most T such iterations.

Bound On Regret of Algorithm 3.1
We first state the following technical upper bound on the regret, and our effort in the rest of the section is devoted to bounding the right-hand side of this result.The proof of the following result (provided in Appendix A) follows the primal-dual framework in the proof of Theorem 1 in [BLM20].
Proposition 3.1.With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 3.1 on Problem 2.2 is bounded by where g t and λ t are as defined in Line 5 and Line 6 of Algorithm 3.1.We note that the bound on the right-hand side can be negative since Algorithm 3.1 does not guarantee T t=1 g t (b t ) ≥ 0, but we do show a bound on the worst-case constraint violation in Lemma 3.1.
Computing a bound on the regret then requires bounding T t=1 λ t • g t (b t ), which we do in the following lemma (and this is where a major chunk of our technical contribution lies).Lemma 3.2.For any input sequence − → γ of length T , running Algorithm 3.1 on Problem 2.2 generates sequences {b t } T t=1 , {g t } T t=1 and {λ t } T t=1 such that Proof.In order to bound the quantity t∈[T ] g t (b t ) • λ t , we first write From Line 6 in Algorithm 3.1, we have (3.9) Applying Equation (3.9) and Equation (3.8) to Equation (3.7) gives where the third step follows from the local strong convexity of V h as shown in [AO14] (for completeness, we prove this fact in Appendix D), and the final step is by Cauchy-Schwarz inequality.We now bound the term 1 2 α 2 g t (b t ) 2 • max(λ t , λ t+1 ) in a case-wise manner.
C.1 Assume g t (b t ) ≥ 0.Then, the inequality g t (b t ) ≤ 1 (from Proposition A.1) and our choice of where we used exp(−x) ≤ 1 − x/2 for x ∈ [0, 1.5].Finally, we use 0 ≤ g t (b t ) ≤ 1 and Inequality (3.12) to obtain: This gives Thus, in all cases, we have Plug this back in our previous bound in Inequality (3.11), we get Dividing throughout by α and summing over t = 1, 2, . . ., T , enables telescoping; plugging in the chosen values of λ 1 and α from Algorithm 3.1 then yields the claimed bound.

Bidding Under a Strict RoS Constraint
In this section, we introduce a simple technique that turns an algorithm with non-zero (but bounded) violation of the RoS constraint into one strictly obeying it.As noted in Section 5, this technique extends to the problem with both RoS and budget constraints.Our idea is as follows.
Suppose we have an algorithm (say, Alg), which can guarantee an at most v RoS violation of the RoS constraint on any sequence − → γ of input requests.We start by bidding the true value b t = v t in the initial iterations t = 1, . . ., K( − → γ ) for some K( − → γ ) -we call this sequence of iterations the first phase.In a truthful auction, this choice of bids guarantees In other words, the bidder builds up a buffer on the RoS constraint.We continue until the cumulative buffer we run Alg afresh (i.e.without accounting for the first phase); recall, this violates the RoS constraint by at most v RoS over the remaining iterations.We refer to this run of Alg as the second phase.Since the buffer from the first phase is enough to offset Alg's violation in the second phase, there is no violation of the RoS constraint at the end.We display this idea in Algorithm 4.1.
Algorithm 4.1 Bidding algorithm of the RoS agent operating under strict constraints in truthful auctions (i.i.d.inputs) 1: Input: Total time horizon T and requests − → γ from the data distribution P T .2: Initialize: Observe the value v t , and set the bid b t = v t .

5:
Observe the price p t and the allocation x t at b t , and compute g t (b t ) := v t • x t (b t ) − p t (b t ).

6:
Increment the iteration count t = t + 1. 7: end while 8: Run Algorithm 3.1 with time horizon T − t and the remaining T − t requests from − → γ as input.
⊲ Second phase 9: return The sequence {b t } T t=1 of generated bids from both phases.

Analysis of Algorithm 4.1
The high-level idea to guarantee a low regret for Algorithm 4.1 is to start with the observation that the reward collected by Algorithm 4.1 for any − → γ in T steps is at least that collected by Algorithm 3.1 in the second phase (i.e., the last T − K( − → γ ) steps).The second phase suffers (in expectation) a regret bounded by the guarantee of Theorem 3.1.We then use the i.i.d.assumption on the input sequence to bound the gap between the expected reward collected by Opt in a sequence of length T − K( − → γ ) to that in a sequence of length T , which naturally depends on the expected length of K( − → γ ); finally, we show this expected length is at most O( √ T log T ) under a mild technical assumption on the input distribution; this bounds the additional regret accrued over the first phase and completes the analysis.
To formally see this, we need two simple technical tools (proved in Appendix B) as follows.First, we make the following assumption on the distribution P and state a corresponding result.
Assumption 4.1.Define the parameter β of a distribution P as follows We assume in our problem β is an absolute constant bounded away from 0 and independent of T .
The parameter β is the expected amount of buffer we accrue per iteration during the first phase.
The assumption of β being a constant bounded away from 0 captures the more interesting scenarios of bidding under RoS.For example, when the allocation and price functions of each query arise from a single-item second-price auction, the essence of the problem becomes how best to spend the extra slack v t − p t (v t ) gained from queries with v t > p(v t ).If β is tiny, or in the extreme case of β = 0, there is nothing to optimize for, and the optimal solution would simply be b t = v t all the time.
Since we need to only accrue a buffer of size O( √ T log T ), and the expected increment of the buffer is a constant β per iteration, we can show the first phase finishes in O( √ T log T ) iterations in expectation.
Proposition 4.1.Under Assumption 4.1 for the distribution P, let K( − → γ ) be the number of iterations in the first phase of Algorithm 4.1 for some input sequence − → γ .Then, we have We also need the following technical statement on the difference in reward collected by Algorithm 3.1 and Opt for various lengths of input sequences.Proposition 4.2.Let − → γ ℓ ∼ P ℓ and − → γ r ∼ P r be sequences of lengths ℓ and r, respectively, with ℓ ≤ r, of i.i.d.requests each from a distribution P. Then the following inequality holds.
The above two results help us bound the regret from the first phase, and we can use the guarantee from Theorem 3.1 to bound the regret due to the second phase.Altogether we get the main result below.Further, there is no violation of the RoS constraint in Problem 2.2.
Proof.The claim on constraint violation follows by design of the algorithm: we collect a constraint violation buffer of at least v RoS before starting the second phase, in which we are guaranteed to violate the constraint by an additive factor of at most v RoS .
We now prove the claimed regret bound by combining a lower bound on the expected reward and an upper bound on the expected optimum.To lower bound the algorithm's reward, we note that it is at least the reward from the second phase.
Let K( − → γ ) be the random variable that represents the last iteration of the first phase, after which we run Algorithm 3.1.In this proof, we use − → γ a:b to denote the sequence − → γ from time steps a through b; when these end points do not matter, we simply denote a length T sequence as − → γ .With this notation, we have for any sequence − → γ : Then taking expectations on both sides and using conditional expectations gives where the third step is due to the requests all being i.i.d. and the fact that we start running Algorithm 3.1 fresh in the second phase, and the fourth step is by Proposition 4.2.Finally, plugging Inequality (4.1) into the definition of Regret from Equation (2.8) and simplifying gives where in the third step we use Reward(Opt, − → γ ) ≤ T , and in the last step we use Proposition 4.1 and Assumption 4.1 that β is a constant.
Although we structure the algorithm as the first phase followed by the second phase for the purpose of the analysis, the value of v RoS can be a pessimistic bound and make us run the first phase for an unnecessarily long time.A more practical implementation can break the first phase into smaller chunks and intermingle them with the execution of Algorithm 3.1 on demand.That is, whenever we are about to violate the RoS constraint during Algorithm 3.1, we put it on hold and bid exactly the value v t for some iterations until we build up a certain amount of buffer and then resume the execution of Algorithm 3.1, where these special iterations are ignored (i.e., won't affect the dual variable updates).Intuitively this can perform better than Algorithm 4.1, although without a rigorous regret guarantee.

Bidding Under Both RoS and Budget Constraints
In this section, we combine our techniques from Section 3 and Section 4 with those of [BLM20] to obtain bidding algorithms that satisfy both the RoS and budget constraints (i.e., Problem 2.5).More specifically, we propose Algorithm 5.1 to solve Problem 2.5 under a strict imposition of the budget constraint and only an approximate one on the RoS constraint and Algorithm 5.2 to strictly satisfy both the RoS and budget constraints.

Approximate RoS and Strict Budget Constraints
We start with a bidding algorithm that satisfies the budget constraint exactly and the RoS constraint up to some small violation in the worst case.Similar to the bidding rule in Equation (3.2), the candidate bid for this algorithm is the one maximizing the price-adjusted reward, with one dual variable for each constraint: where the solution to the maximization holds by the definition of a truthful auction.Since we are in the setting where the budget constraint is strict, the candidate bid given by Equation (5.1) is used as the final bid in this iteration only if we are not close to exhausting the total budget, as formalized in Line 4 of Algorithm 5.1.Similar to Algorithm 3.1, the Lagrange multipliers λ t and µ t serve to enforce the RoS and budget constraints respectively.We remark that Algorithm 5.1 may be interpreted as combining our Algorithm 3.1 with Algorithm 1 of [BLM20].The analysis of the regret bound also follows the outline of the main proof of [BLM20], integrating it with our regret bound for Algorithm 3.1 from Theorem 3.1.Intuitively the integration is straightforward since the analyses of both methods are linear in nature, allowing us to easily decompose the intermediate regret bound from the primal-dual framework into two components corresponding to the two constraints.We then bound them respectively with the same techniques from earlier sections to handle the RoS constraint and the techniques from [BLM20] to address the budget constraint.
As for the constraint violation guarantees, the budget constraint is always satisfied by design, and the RoS violation bound follows from a simple corollary of Lemma 3.1, which applies here since the extra budget constraint makes our bid only more conservative than that of Algorithm 3.1.We formally collect and state these results in Theorem 5.1, proving it in Appendix C.1.
Algorithm 5.1 Bidding algorithm of the agent operating under approximate RoS constraints and strict budget constraints in a truthful auction (i.i.d.inputs) 1: Input: Total time horizon T , requests − → γ i.i.d.from the distribution P T , total budget ρT .2: Initialize: Initial dual variable λ 1 = 1, µ 1 = 0, total initial budget B 1 := ρT , dual mirror descent step size α = 1 Observe the value v t , and set the bid b Update the dual variable of the RoS constraint as Update the dual variable of the budget constraint as

9:
Update the leftover budget B t+1 = B t − p t (b t ).10: end for 11: return The sequence {b t } T t=1 of bids.
Theorem 5.1.With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 5.1 on Problem 2.5 is bounded by Further, Algorithm 5.1 incurs a violation of at most O( √ T log T ) of the RoS constraint and no violation of the budget constraint.

Strict RoS and Strict Budget Constraints
In the case when we impose both strict RoS and strict budget constraints, we essentially combine the key ideas from Algorithm 4.1 and Algorithm 5.1: We keep bidding the value until we accumulate a sufficient buffer on the RoS constraint; following this phase, we run Algorithm 5.1, which, as explained in the preceding section, imposes strict budget and approximate RoS constraints.
By the same reasoning as for Algorithm 4.1, the RoS constraint is not violated; the budget constraint is also respected via Line 8 of Algorithm 5.2.The regret analysis follows a strategy similar to that of the proofs of Theorem 5.1 and Theorem 4.1.Our main result of this section is in Theorem 5.2, with its proof in Appendix C.2. Algorithm 5.2 Bidding algorithm of the agent operating under strict RoS and strict budget constraints in truthful auctions (i.i.d.inputs) 1: Input: Total time horizon T , requests − → γ i.i.d.from the distribution P T , total budget ρT .2: Initialize: Set the initial buffer g 0 (b 0 ) = 0, v RoS = 2 √ T log T , initial total budget B 1 = ρT , and iteration t = 1.
Observe the value v t , and set the bid b t = v t .

5:
Observe the price p t and the allocation x t at b t , and compute Update the total budget to B t+1 = B t − p t (b t ).

10:
end if 11: end while 12: Run Algorithm 5.1 with time horizon T − t and the remaining T − t requests from − → γ as input and initial total budget B t .⊲ Second phase 13: return The sequence {b t } T t=1 of generated bids.
Theorem 5.2.With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 5.2 on Problem 2.5 is bounded by Further, Algorithm 5.2 suffers no constraint violation of either the RoS or budget constraint.
Proposition A.2. Recall D RoS (λ|P) as defined in Equation (A.2).Then the optimum value Reward(Opt, − → γ ) for Problem 2.2 defined for a sequence − → γ ℓ ∼ P ℓ of ℓ requests satisfies the inequality Proof.We have, by the definition of Reward(Opt, − → γ ) in Equation (2.7) and the definition of g t in Equation (2.3) that By Sion's min-max theorem and by the definition of f ⋆ t,RoS in Equation (A.1), we have that Now taking expectations6 on both sides of Inequality (A.3) and using the definition of D RoS from Equation (A.2) gives, for any fixed λ ′ ≥ 0, where the final equation crucially uses the fact that all t requests are drawn i.i.d.from the same distribution P and the argument λ ′ > 0 is fixed.Therefore, in particular, the preceding inequality holds for the specific λ ′ that minimizes the right-hand side, thus finishing the proof.
Proposition A.3.For some fixed number r, let λ r := 1 r r t=1 λ t , where λ t are the dual iterates in Algorithm 3.1.Then, the reward (see Equation (2.6)) of Algorithm 3.1 is lower bounded as Proof.Recall we have by Line 4 in Algorithm 3.1 and by the definition of f ⋆ t,RoS in Equation (A.1), Taking expectations on both sides by conditioning on the randomness until (and including) iteration t − 1, we have, where the second step used the fact that in Algorithm 3.1, the λ t is fixed when σ t−1 is, and the final step used the definition of D RoS in Equation (A.2).Summing over t = 1, 2, . . ., r and taking expectations (this is a valid operation since r is a fixed number) gives, and we may finish the proof by invoking convexity of f ⋆ RoS .
Proof of Proposition 3.1.We begin by restating the result from Proposition A.2 for an input sequence of length T : The minimum on the right-hand side of the preceding inequality can be further bounded as follows for any λ( − → γ ) ≥ 0.
In particular, then, we may choose λ( − → γ ) = 1 T T t=1 λ t := λ T on the right-hand side of Inequality (A.5) and combine with Inequality (A.4) to obtain Combining this with Proposition A.3 applied to an input sequence of length T finishes the proof.
Theorem 3.1.With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 3.1 on Problem 2.2 is bounded by Further, the violation of the RoS constraint is at most 2 √ T log T .
Proof.Plugging Lemma 3.2 into Proposition 3.1 gives the regret bound.To see the claim on constraint violation, we first note from Proposition A.1 that the gradients g t in Algorithm 3.1 satisfy g t (b t ) ≥ −1/λ t .Therefore, the result of Lemma 3.1 applies; note that constraint violation of Algorithm 3.1 corresponds precisely to − T t=1 g t (b t ). A Observe the realization of the valuation v t , and set the bid: b t = vt λt + v t .

5:
Observe p t and x t evaluated at the chosen bid b t and compute g t (b t ) = v t • x t (b t ) − p t (b t ).
Proof.We observe that the regularizer h(u) = 1 2 u 2 is 1-strongly convex on the non-negative orthant, which implies we can use Proposition 3.1 and mirror descent error bounds (Lemma D.1) to bound the average regret as By choosing λ * = 0 (and using λ 1 = 1) in this bound yields the claimed regret bound.
Lemma A.2. Running Algorithm A.1 on Problem 2.2 yields a constraint violation of at most α + T 1−c for some parameter c > 0.
Proof.Based on the dual update rule in Algorithm A.1, λ t+1 ≥ λ t − αg t (b t ), as a result of which g t (b t ) ≥ 1 α (λ t − λ t+1 ).Let c > 0. Let T ′ be the last time step at which λ T ′ ≤ T c .Then, Further, the violation of the RoS constraint in Problem 2.2 is at most O(T 2/3 ).
Proof.By picking α = T −1/3 in Lemma A.1 yields a regret bound of O(T 2/3 ) and picking c = 1/3 in Lemma A.2 along with the same α gives a maximum constraint violation of O(T 2/3 ).

B Proofs of the Strict RoS Constraint
Proposition 4.2.Let − → γ ℓ ∼ P ℓ and − → γ r ∼ P r be sequences of lengths ℓ and r, respectively, with ℓ ≤ r, of i.i.d.requests each from a distribution P. Then the following inequality holds.
Proposition C.1.For some ρ ′ ≥ 0, let D combined (µ, λ|P, ρ ′ ) be as defined in Equation (C.2).Then the optimum value Reward(Opt, − → γ ℓ , ρ) for Problem 2.5 with a total initial budget of ρℓ over a sequence − → γ ℓ ∼ P ℓ of ℓ requests satisfies the inequality Proof.In this proof, we essentially repeat the ideas in the proof of Proposition A.2. First, by definition, we have By application of Sion's minimax theorem and the definition of f ⋆ t,combined , we get for some fixed µ ′ ≥ 0 and λ ′ ≥ 0.Then, taking the expectations on both sides of Inequality (C.3) and using the fact that v t , x t , p t are all drawn from i.i.d.distributions, we get Therefore, in particular, the preceding inequality holds for the λ, µ ≥ 0 that minimize the bound, thus finishing the proof.
Next, to analyze the reward collected by Algorithm 5.1, similar to [BLM20], which gives an algorithm for a strict budget constraint, we need the following notion of stopping time.
Definition C.2.The stopping time τ of Algorithm 5.1, with a total initial budget of B is the first time τ at which Intuitively, this is the first time step at which the total price paid almost exceeds the total budget.
To prove our main regret bound in Theorem 5.1, we first prove Proposition C.2, which gives the minimum expected reward of Algorithm 5.1.The proof follows along the lines of that in [BLM20] and Proposition A.3.Proposition C.2.Let τ be a stopping time as defined in Definition C.2 for some initial budget ρ ′ k.Then the expected reward (see Equation (2.6)) of Algorithm 5.1 over a sequence of length k with i.i.d.input requests from distribution P k is lower bounded as Proof.By Line 4 in Algorithm 5.1, we have, until the stopping time t = τ , Rearranging the terms and taking expectations conditioned on the randomness up to step t − 1, Per Line 7 in Algorithm 5.1 that once we fix the randomness up to t − 1, the dual variables are all fixed, which gives us Summing over t = 1, 2, . . ., τ, and using the Optional Stopping Theorem, we get We finish the proof by using the convexity of D combined in the preceding equation.
Our bound on regret requires the following technical result bounding one of the terms arising in Proposition C.2.
Proposition C.3.Consider a run of Algorithm 5.1 with initial total budget ρℓ and the total time horizon ℓ.We define the corresponding stopping time (as defined in Definition C.2) as the time τ at which τ t=1 p t (b t ) ≥ ρℓ−1.Then, the dual variable {µ t } that evolves as per Line 8 in Algorithm 5.1 satisfies the following inequality.
Proof.To bound τ t=1 µ t • (ρ − p t (b t )), we observe that the mirror descent guarantee of Lemma D.1 applies to give, for any µ ≥ 0, where Err(R, η) := 1 σ (1 + ρ 2 )ηR + 1 η V h (µ, µ 1 ), where h(u) = 1 2 u 2 , σ = 1, and µ 1 = 0. To finish the proof, we choose µ = 1/ρ, use τ t=1 p t (b t ) ≥ ρℓ − 1, and choose η = Proof.Recall that τ is the stopping time of Algorithm 5.1 as defined in Definition C.2.Then, where the final step is because Reward(Opt, − → γ , ρ) ≤ T (due to the value capped at one We invoke Proposition C.3 with a total initial budget ρT and total time horizon T to get Finally, we invoke Lemma 3.2 to conclude R t=1 λ t g t ≤ O( √ R), which lets us conclude Combining Inequality (C.8), Inequality (C.9), and Inequality (C.10) yields the claimed bound.We now finish with the proof of maximum constraint violation.By design of Algorithm 5.1, the budget constraint is never violated throughout the run of the algorithm.To see the claimed maximum violation of the RoS constraint, we note by Proposition A.1 that the gradient g t satisfies g t (b t ) ≥ −1/λ t , as a result of which, Lemma 3.1 applies.Further, Algorithm 5.2 suffers no constraint violation of either the RoS or budget constraint.

C.2 Proofs for Strict RoS and Strict Budget Constraints
Proof.The fact that the RoS constraint is not violated may be seen by the fact that the first phase accumulates the exact buffer that is the guaranteed cap on constraint violation by the second phase.The budget constraint is not violated by design: the first phase (which lasts at most ρT iterations) pays at most unit price per iteration, followed by the second phase, which, by the guarantee of Algorithm 5.1, strictly respects the budget constraint.
To bound the regret, we note that the total expected reward is at least as much as is collected in the second phase T −K( − → γ ) .Conditioning on the event high-probability event that k ≤ ρT (by Inequality (B.1) coupled with the assumption that ρ is a fixed constant), we have

D Online Mirror Descent
Lemma D.1 ([Bub+15], Theorem 4.2).Let h be a mirror map which is ρ-strongly convex on X ∩D with respect to a norm • .Let f be convex and L-Lipschitz with respect to • .Then, mirror descent with step size α satisfies

Theorem 4. 1 .
With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 4.1 on Problem 2.2 is bounded by Regret(Algorithm 4.1, P T ) ≤ O( √ T log T ).
5) Combining Equation (C.5) with the definition of D combined in Equation (C.2) and plugging back into Equation (C.4) then gives 1 (1+ρ 2 ) √ ℓ in Err(τ, η) Theorem 5.1.With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 5.1 on Problem 2.5 is bounded by Regret(Algorithm 5.1, P T ) ≤ O( √ T ).Further, Algorithm 5.1 incurs a violation of at most O( √ T log T ) of the RoS constraint and no violation of the budget constraint.

Theorem 5. 2 .
With i.i.d.inputs from a distribution P over a time horizon T , the regret of Algorithm 5.2 on Problem 2.5 is bounded by Regret(Algorithm 5.2, P T ) ≤ O( √ T log T ).
.1 Approximate RoS Constraints Using the Squared Mirror MapLemma A.1.The regret, as defined in Equation (2.8), of Algorithm A.1, run on Problem 2.5, with i.i.d.inputs from distribution P over a time horizon T is bounded from above by 2αT + α −1 .Algorithm for bidding under a RoS constraint in truthful auctions with squared regularizer 1: Input: Total time horizon T and requests − → γ from the data distribution P T 2: Initialize: Dual variable λ 1 = 1, step size for dual update α, and mirror map h α and applying g t (b t ) ≥ −1/λ t (which comes from Proposition A.1) with λ t ≥ T c for t ≥ T ′ gives T t=T ′ g t (b t ) ≥ −T 1−c .Therefore, we have T t=1 g t (b t ) ≥ −T c /α − T 1−c .