The Role of Transparency in Repeated First-Price Auctions with Unknown Valuations

We study the problem of regret minimization for a single bidder in a sequence of first-price auctions where the bidder discovers the item’s value only if the auction is won. Our main contribution is a complete characterization, up to logarithmic factors, of the minimax regret in terms of the auction’s transparency, which controls the amount of information on competing bids disclosed by the auctioneer at the end of each auction. Our results hold under different assumptions (stochastic, adversarial, and their smoothed variants) on the environment generating the bidder’s valuations and competing bids. These minimax rates reveal how the interplay between transparency and the nature of the environment affects how fast one can learn to bid optimally in first-price auctions.


Introduction
The online advertising market has recently transitioned from second to first-price auctions.A remarkable example is Google AdSense's move at the end of 2021 [Wong, 2021], following the switch made by Google AdManager and AdMob.Earlier examples include OpenX, AppNexus, Index Exchange, and Rubicon [Sluis, 2017].To increase transparency in first-price auctions, some platforms (like AdManager) have a single bidding session for each available impression (unified bidding) and require all partners to share and receive bid data.After the first-price auction closes, bidders receive the minimum bid price that would have won them the impression [Bigler, 2019].In practice, advertisers face two main sources of uncertainty in the bidding phase: they ignore the value of the competing bids and, crucially, ignore the actual value of the impression they are bidding on.Indeed, clicks and conversion rates-which are only measured after the auction is won and the ad is displayed-can vary wildly over time or highly correlate with competing bids.We remark that ignoring the value of the impression strongly affects the bidder's utility: it may lead to overbidding for an impression of low value or, conversely, underbidding and losing a valuable one.To cope with this uncertainty, advertisers rely on auto-bidders that use the feedback provided in the auctions to learn good bidding strategies.We study the learning problem faced by a single bidder within the framework of regret minimization according to the following protocol: Online Bidding Protocol for t = 1, 2, . . ., T do Valuation V t and competing bid M t are privately generated The learner posts a bid B t and receives utility Util t (B t ): The learner observes some feedback Z t The bidder has no initial information on the environment and seeks to learn the relevant features of the problem on the fly.The performance of a learning strategy for the bidder-also referred to as the learner-is measured in terms of the difference in total utility with respect to the best fixed bid.This difference is called regret, and the main goal is to design strategies with asymptotically vanishing time-averaged regret with respect to the best fixed-bid strategy or, equivalently, regret sublinear in the time horizon.
In this work, we are specifically interested in understanding how the "transparency" of the auctionsi.e., the amount of information on competing bids disclosed by the auctioneer after the auction takes place-affects the learning process.There is a clear tension regarding transparency: on the one hand, bidders want to receive as much information as possible about the environment to learn the competitor's bidding strategies while revealing as little as possible about their (private) bids.
On the other hand, the platform may not want to publicly reveal its revenue (i.e., the winning bid).Our investigation addresses both sides of the "transparency dilemma".Our algorithmic results provide bidders with a toolbox of learning strategies to (optimally) exploit the various degrees of transparency, while the tightness of our results fully characterizes the impact of transparency on learnability.This complete picture allows platforms to make an informed decision in choosing their level of transparency, as it is in their interest to create a thriving environment for advertisers.
To model the level of transparency, we distinguish four natural types of feedback Z t , specifying the conditions under which the highest competing bid M t and the bidder's valuation V t are revealed to the bidder after each round t.In the transparent feedback setting, M t is always observed after Table 1: Summary of our results.Rows correspond to feedback models while columns to environments.The minimax regret of every problem falls in one of the following three regimes: Θ( √ T ) (green), Θ(T 2 /3 ) (yellow) and Θ(T ) (red).
the auction is concluded, while V t is only known if the auction is won, i.e., when B t ≥ M t .In the semi-transparent setting, M t is only observed when the auction is lost.In other words, in the semi-transparent setting, the platform publicly reveals only the winning bid, whereas in the transparent setting, the platform reveals all bids.We also consider two extreme settings that provide two natural learning benchmarks: full feedback (M t and V t are always observed irrespective of the auction's outcome) and bandit feedback (M t is never observed while V t is only observed by the winning bidder).Note that the learner can compute the value of the utility Util t (B t ) at time t with any type of feedback, including bandit feedback.In this paper, we characterize the learner's minimax regret not only with respect to the degree of transparency of the auction but also with respect to the nature of the process generating the sequence of pairs (V t , M t ).In particular, we consider four types of environments: stochastic i.i.d., adversarial, and their smooth versions (see Section 1.3 for a discussion about smoothness, and Section 2 for the formal definition).

Overview of Our Results
We report here an overview of our results (see also Table 1).For simplicity, we often hide the logarithmic factors with the O notation.

Stochastic i.i.d. settings
• In both the full and transparent feedback models, the minimax regret is of order √ T (Theorems 4 and 5), and adding the smoothness requirement leaves this rate unchanged.• In the semi-transparent feedback model, the minimax regret is of order T 2 /3 (Theorems 2 and 3).
Also in this case, adding the smoothness requirement leaves this rate unchanged.• In the bandit feedback model, smoothness is crucial for sublinear regret (Theorem 1).In particular, smoothness implies a minimax regret of T 2 /3 (this is obtained by combining the upper bound in Theorem 6 and the lower bound in Theorem 3).

Adversarial settings
• Without smoothness, sublinear regret cannot be achieved, even with full feedback (Theorem 8).
• In both the full and transparent feedback model, the minimax regret in a smooth environment is of order √ T (combining the lower bound in Theorem 5 and the upper bound in Theorem 7).• Both with semi-transparent and bandit feedback, the minimax regret in a smooth environment is of order T 2 /3 (combining the lower bound in Theorem 3 and the upper bound in Theorem 6).Interestingly, the minimax regret rates for first-price auctions mirror the allowed regret regimes in finite partial monitoring games [Bartók et al., 2014] and online learning with feedback graphs [Alon et al., 2017].This is somehow surprising, as it has been shown in Lattimore [2022] that games with continuous outcome/action spaces allow for a much larger set of regret rates-see also Cesa-Bianchi et al. [2023, 2024b], Bolić et al. [2024], Bernasconi et al. [2024].Table 1 reveals some interesting properties of the learnability of the problem: full feedback and transparent feedback are essentially equivalent, while semi-transparent feedback and bandit feedback differ only in the stochastic i.i.d.setting.Qualitatively, this tells the platform that disclosing all bids (instead of only the winning one) drastically improves the learnability of the problem (green vs. yellow entries in Table 1).Besides, revealing at least the winning bid avoids some pathological behavior (yellow entries vs. red entry for the general i.i.d.environment with bandit feedback).Moreover, while smoothness is key for learning in the adversarial setting, in the stochastic case smoothness is only relevant for bandit feedback.where the two utility functions of the other two plots are summed up.There, b ⋆ is the optimal bid and ∆ is the neighborhood of b ⋆ where the total utility is "good enough").

Technical Challenges
The utility function.The utilities Util t (b) = (V t − b)I{M t ≤ b} are defined over a continuous decision space [0, 1] and are neither Lipschitz nor continuous, see Figure 1.Actually, even weaker properties, i.e., that the expected cumulative reward b → t∈[T ] E Util t (b) is one-sided Lipschitz or semi-continuous, do not hold in general.We address this problem by developing techniques designed to control the approximation error incurred when discretizing the bidding space.This is a non-trivial problem without regularity assumption, as the neighborhood of the optimal bid where the total utility is "good enough" can be arbitrarily small in general (see the red interval ∆ in the rightmost plot of Figure 1).In the stochastic i.i.d.setting, the approximation error is controlled by building a sample-based non-uniform grid of candidate bids, which can be of independent interest.This allows us to estimate the distribution of the competing bids uniformly over the subintervals of [0, 1].In the adversarial setting, instead, we use the smoothness assumption to guarantee that the expected utility is Lipschitz.In this case, the approximation error is controlled using a uniform grid with an appropriate grid-size (Lemma 4).
The feedback models.Our feedback models interpolate between bandit (only the bidder's utility is observed) and full feedback (V t and M t are always observed).In the stochastic i.i.d.case, the different levels of transparency are crucial to the process of building the non-uniform grids used to control the discretization error.In the adversarial case, when there are only K allowed bids, the optimal rates are of order √ T ln K and √ KT under full and bandit feedback, respectively.While the semi-transparent feedback is not enough to improve on the bandit rate, the transparent one can be exploited via a more sophisticated approach.To this end, we design an algorithm, Exp3.FPA, enjoying the full feedback regret rate of order √ T ln K while only relying on the weaker transparent feedback.
Lower bounds.The linear lower bounds (Theorems 1 and 8) exploit a "needle in a haystack" phenomenon, where there is a hidden optimal bid b ⋆ in the [0, 1] interval and the learner has no way of finding b ⋆ using the feedback it has access to.This is indeed the case in the non-smooth adversarial full-feedback setting and in the non-smooth i.i.d.bandit setting.To prove the remaining lower bounds, we design careful embeddings of known hard instances into our framework.In particular, in Theorem 5 we embed the hard instance for prediction with two experts and in Theorem 3 the hard instance for K = Θ(T 1 /3 ) bandits.

Related Work
Transparency in first-price auctions.The role of transparency in repeated first-price auctions has been investigated by Bergemann and Hörner [2018], but mostly from a game-theoretic viewpoint.In particular, they study the impact of the feedback policy on the bidders' strategy and show how disclosing the bids at the end of each round affects the equilibria of a bidding game with infinite horizon.In contrast, we want to characterize the impact of different amounts of feedback (or degrees of transparency) on the learner's regret, which is measured against the optimal fixed bid in hindsight.
Auctions with unknown valuations.Although the problem of regret minimization in firstprice auctions has been studied before, only a few papers consider the natural setting of unknown valuations.Feng et al. [2018] introduce a general framework for the study of regret in auctions where a bidder's valuation is only observed when the auction is won.In the special case of first-price auctions, their setting is equivalent to our transparent feedback when the sequence of pairs (V t , M t ) is adversarially generated.Following a parameterization introduced by Weed et al. [2016], Feng et al. [2018] provide a O T ln max{∆ −1 0 , T } regret bound, where ∆ 0 = min t<t ′ |M t − M t ′ | is controlled by the environment.In the stochastic i.i.d.case, their results translate into distribution-dependent guarantees that do not translate into a worst-case sublinear bound (we obtain a √ T rate).In the adversarial case, their guarantees are still linear in the worst-case (we obtain √ T bounds by leveraging the smoothness assumption).Achddou et al. [2021] consider a stochastic i.i.d.setting with the additional assumption that V t and M t are independent.Their main result is a bidding algorithm with distribution-dependent regret rates (of order T 1 /3+ε or √ T , depending on the assumptions on the underlying distribution) in the transparent setting.Again, this result is not comparable to ours because of the independence assumption and the distribution-dependent rates (which do not allow to recover our minimax rates).Other works consider regret minimization in repeated second-price auctions with unknown valuations.Dikkala and Tardos [2013] investigate a repeated bidding setting, but do not consider regret minimization.Weed et al. [2016] derive regret bounds for the case when M t are adversarially generated, while V t are stochastically or adversarially generated and the feedback is transparent.
First-price auctions with known valuations.Considerably more works study first price auctions when the valuation V t is known to the bidder at the beginning of each round t.Note that these results are not directly comparable to ours.Balseiro et al. [2019] look at the case when the V t are adversarial and the M t are either stochastic i.i.d. or adversarial.In the bandit feedback case (when M t is never observed), they show that the minimax regret is Θ T 2 /3 in the stochastic case and Θ T 3 /4 in the adversarial case.Han et al. [2020b] prove a O √ T regret bound in the semi-transparent setting (M t observed only when the auction is lost) with adversarial valuations and stochastic bids.Han et al. [2020a] focus on the adversarial case, when V t and M t are both generated adversarially.They prove a O √ T regret bound in the full feedback setting (M t always observed) when the regret is defined with respect to all Lipschitz shading policies.This setup is extended in Zhang et al. [2022] where the authors consider the case in which the bidder is provided access to hints before each auction.Zhang et al. [2021] also studied the full information feedback setting and design a space-efficient variant of the algorithm proposed by Han et al. [2020a].Badanidiyuru et al. [2023] introduce a contextual model in which V t is adversarial and M t = ⟨θ, x t ⟩ + ε t where x t ∈ R d is contextual information available at the beginning of each round t, θ ∈ R d is an unknown parameter, and ε t is drawn from an unknown log-concave distribution.They study regret in bandit and full feedback settings.
Dynamics in first-price auctions.A different thread of research is concerned with the convergence property of the regret minimization dynamics in first-price auctions (or, more specifically, with the learning dynamics of mean-based regret minimization algorithms).Feldman et al. [2016] show that with continuous bid levels, coarse-correlated equilibria exist whose revenue is below the second price.Feng et al. [2021] prove that regret minimizing bidders converge to a Bayesian Nash equilibrium in a first-price auctions when bidder values are drawn i.i.d.from a uniform distribution on [0, 1].Kolumbus and Nisan [2022] show that if two bidders with finitely many bid values converge, then the equilibrium revenue of the bidder with the highest valuation is the second price.Deng et al. [2022] characterize the equilibria of the learning dynamics depending on the number of bidders with the highest valuation.Their characterization is for both time-average and last-iterate convergence.
Smoothed adversary.Smoothed analysis of algorithms, originally introduced by Spielman and Teng [2004] and later formalized for online learning by Rakhlin et al. [2011], Haghtalab et al. [2020], is a known approach to the analysis of algorithms in which the instances at every round are generated from a distribution that is not too concentrated.Recent works on the smoothed analysis of online learning algorithms include Kannan et al. [2018], Haghtalab et al. [2020Haghtalab et al. [ , 2022]], Block et al. [2022], Durvasula et al. [2023], Cesa-Bianchi et al. [2023, 2024b, 2021, 2024c], Bolić et al. [2024].
Online learning in metric spaces.Our problem is related to online learning in metric spaces [Kleinberg et al., 2019], where the action space is endowed with a metric and the losses are induced by a sequence of Lipschitz functions defined onto it.Tight regret bounds are known, parameterized by some notion of dimension of the metric space, in both the full and the bandit models.The simple structure of our action space ([0, 1] with the Euclidean distance) allows us to obtain tight bounds by either using a uniform grid (Theorems 6 and 7) or sample-based grids (Theorems 2 and 4), without resorting to the more elaborate techniques that characterize this line of research, e.g., zooming (which is typically used in the bandit feedback model to account for the lack of feedback).Also related to our model is the study of piecewise and regular Lipschitz functions [Balcan et al., 2018, Sharma et al., 2020, Duetting et al., 2023].In particular, Lemma 1 and Theorem 3 in Balcan et al. [2018] imply our Theorem 6 in the special case of independent processes.* * Combining the second part of their Lemma 1 with their Theorem 3 to lift independence gives void guarantees in the general case (note that there is a typo in the statement of their Lemma 1: as it can be seen in the proof, the

The Learning Model
We introduce formally the repeated bidding problem in first-price auctions.At each time step t, a new item arrives for sale, for which the learner holds some unknown valuation V t ∈ [0, 1].The learner bids some B t ∈ [0, 1] and, at the same time, a set of competitors bid for the same object.We denote their highest competing bid by M t ∈ [0, 1].The learner gets the item at cost B t if it wins the auction (i.e., if B t ≥ M t ), and does not get it otherwise.Then, the learner observes some feedback Z t and gains utility Util t (B t ), where, for all b ∈ [0, 1], Util t (b) = (V t − b)I{b ≥ M t } (see the Protocol in Section 1).Crucially, at time t the learner does not know its valuation V t for the item before bidding, implying that its bid B t only depends on its past observations Z 1 , . . ., Z t−1 (and, possibly, some internal randomization).The goal of the learner is to design a learning algorithm A that maximizes its utility.More precisely, we measure the performance of an algorithm A by its regret R T (A) against the worst environment S in a certain class Ξ: R T (A) = sup S∈Ξ R T (A, S), where The expectation in the previous display is taken with respect to the randomness of the algorithm A which selects B t , and (possibly) the randomness of the environment S generating the (V t , M t ) pairs.The environments.In this paper we consider both stochastic i.i.d. and adversarial environments.
. . is generated by an oblivious adversary.Following previous works in online learning (see Section 1.3), we also study versions of the above environments that are constrained to generate the sequence of (V t , M t ) values using distributions that are "not too concentrated".To this end, we introduce the notion of smooth distributions.
Definition 1 (Haghtalab et al. [2021]).Let X be a domain that supports a uniform distribution ν.A measure µ on X is said to be σ-smooth if for all measurable subsets A ⊆ X , we have µ(A) ≤ ν(A) σ .We thus also consider the following two types of environments.
• The σ-smooth stochastic i.i.d.environment, which is a stochastic i.i.d.environment where the common distribution of all pairs (V 1 , M 1 ), (V 2 , M 2 ), . . . is σ-smooth.• The σ-smooth adversarial setting, where the pairs (V 1 , M 1 ), . . .form a stochastic process such that, for each t, the distribution of the pair The feedback.After describing the environments that we study, we now specify the types of feedback the learner receives at the end of each round, from the richest to the least informative.
• Full feedback.The learner observes its valuation and the highest competing bid: • Transparent feedback.The learner always observes M t , but V t is only revealed if it gets the item: ) and, without assuming independence, P = T in our setting).
† This feedback is similar to the winner-only feedback in Han et al. [2020b].
‡ We call this the bandit feedback because it is equivalent to receiving Utilt(Bt) (with the extra information ⋆ to distinguish between losing the item and winning it with Vt = Bt, which does not affect regret guarantees).

The Stochastic i.i.d. Setting
In this section, we investigate the problem of repeated bidding in first-price auctions with unknown valuations, when the pairs of valuations and highest competing bids are drawn i.i.d.from a fixed but unknown distribution.We start by proving in Section 3.1 that it is impossible to achieve sublinear regret under the bandit feedback model without any assumption on the distribution of the environment.Then, in Section 3.2, we give matching upper and lower bounds of order T 2 /3 in the semi-transparent feedback model.Notably, the lower bound holds for smooth distributions, while the upper bound works for any (possibly non-smooth) distributions.Finally, in Section 3.3 we prove that both the full and transparent feedback yield the same minimax regret regime of order √ T , regardless of the regularity of the distribution.

I.I.D. -Bandit Feedback
In the bandit feedback model, at each time step, the learner observes the valuation V t (and nothing else) when the auction is won, and observes nothing when the auction is lost.The crucial difference with the other (richer) types of feedback is the amount of information received about M t , which, in the bandit case, is just the relative position with respect to B t (i.e., whether M t ≤ B t or B t < M t ).This allows to hide in the interval [0, 1] an optimal bid b ⋆ which the learner cannot uncover over a finite time horizon.Following this idea, a difficult environment should randomize between two scenarios: a good scenario with large value V t = 1 and M t slightly smaller than b ⋆ and a bad one with poor value V t = 0 and M t slightly larger than b ⋆ .Then, to avoid suffering linear regret, the learner has to find this tiny interval around b ⋆ (the "needle in a haystack").
Theorem 1.Consider the problem of repeated bidding in first-price auctions in a stochastic i.i.d.environment with bandit feedback.Then, any learning algorithm A satisfies R T (A) ≥ 1 13 T .
Proof.We construct a randomized i.i.d.environment S, such that any deterministic algorithm A suffers linear regret against it, and then apply Yao's minimax principle to conclude the proof.The randomized environment is simple: before starting the sequence, a uniform seed b ⋆ is drawn uniformly at random in , where ε is a small parameter we set later.Then, the i.i.d.sequence (V 1 , M 1 ), (V 2 , M 2 ), . . . is drawn as follows: at each time step t with probability 1 /2 we have , which is at least T /4, as b ⋆ belongs to ( 1 /3, 1 /2).We now upper bound the utility achievable by any deterministic algorithm A against S. Fix any such algorithm, and consider its bids against any environment that selects the valuations V t to be either 0 or 1 (as the one we just constructed).At each time step, the feedback that A receives is 0, 1 or ⋆ (when the item is allocated to one of the competitors), so that the history of the bids posted by A is naturally described by a ternary decision tree of height T , where each level corresponds to a time step and any node to a bid.Crucially, the leaves of this tree are finite (at most 3 T ), which means that the algorithm A only posts bids in a finite subset N of [0, 1].Now, let ε = 3 −2T /12; we have that, with probability at least 1 Note: the randomness is with respect to the uniform seed b ⋆ drawn by S, while the bound on the probability holds independently to the choice of the deterministic algorithm A. The total utility of A when [b ⋆ , b ⋆ + ε] does not intersect N is easy to analyze: every time that A posts bids smaller than b ⋆ , then it never wins the item (zero utility).Instead, if it posts bids larger than b ⋆ + ε, then it always gets the item (whose average value is 1 /2), paying at least b ⋆ + ε ≥ 1 /3.Putting these two cases together, we have proved that at each time step the expected utility earned by the learner is at most

Collect Bids
1: input: Time horizon T 0 2: X 0 ← 0 and M (0) ← 0 3: for each round t = 1, 2, . . ., T 0 do 4: Post bid B t = 0 and observe the highest competing bid M t 5: Sort the observed highest competing bids in increasing order: 9: 10: if j i = T 0 then let K ← i and break; 11: return Candidate bids X 0 , X 1 , X 2 , . . ., X K at least 1 − e −T ).Finally, by combining the lower bound on the performance of b ⋆ with the upper bound on the expected utility of the learner, we get R T (A, S) ≥ (1 − e −T )( T /4 − T /6) ≥ T /13.

I.I.D. -Semi-Transparent Feedback
In this section, we prove two results settling the minimax regret for the semi-transparent feedback where the environment is i.i.d.(and, possibly, smooth).First, we construct a learning algorithm, Collecting Bandit, achieving T 2 /3 regret against any i.i.d.environment.Then, we complement it with a lower bound of the same order (up to log terms) obtained even in a smooth i.i.d.environment.
3.2.1 A T 2 /3 Upper Bound for the i.i.d.Environment Our learning algorithm Collecting Bandit is composed of two phases.First, for T 0 = Θ(T 2 /3 ) rounds, it collects samples from the highest competing bid random variables M 1 , M 2 , . . ., M T 0 by posting dummy bids Among these values (plus the value X 0 = 0), the algorithm selects Θ( √ T 0 ) candidate bids according to their ordering, in such a way that the empirical frequencies of bids M 1 , M 2 , . . ., M T 0 landing strictly in between two consecutive selected values are at most Θ( 1 / √ T 0 ) (see the pseudocode of Collect Bids for details).Second, for the remaining time steps, it runs any bandit algorithm, using as candidate bids the ones collected in the first phase (see Collecting Bandit for details).Note that, in this second phase, the (less informative) bandit feedback would be enough to run the algorithm: the additional information provided by the semi-transparent feedback is only exploited in the initial "collecting bids" phase.As a first step, we state a simple concentration result pertaining the i.i.d.process M, M 1 , M 2 , . . ., M T 0 , for T 0 ∈ N. If I is the family of all the subintervals of [0, 1] and δ ∈ (0, 1), we define The family I of all the subintervals of [0, 1] has VC dimension 2 (see, e.g., Mitzenmacher and Upfal [2017, Chapter 14.2]).Therefore, E T 0 δ is realized with probability at least 1 − δ, via standard sample complexity bound for ε-samples (see, e.g., Mitzenmacher and Upfal [2017, Theorem 14.15]).This is summarized in the following lemma.
Lemma 1.For every T 0 ∈ N and δ ∈ (0, 1), we have P[E T 0 δ ] ≥ 1 − δ.For the sake of readability, we introduce the following notation: We now prove a lemma that allows us to control the expected cumulative utility of any bid in [0, 1] with that of the best bid in a discretization (without relying on any smoothness assumption).
Lemma 2. Consider any finite grid X = {x 0 , . . ., x K }, with 0 = x 0 < x 1 < • • • < x K ≤ 1, and assume that the process M, M 1 , M 2 , . . . of the highest competing bids form an i.i.d.sequence.For all b ∈ [0, 1] and Summing over all times t and recalling that M t and M share the same distribution, yields the conclusion.
As a corollary of Lemmas 1 and 2 we obtain a similar discretization error guarantee when the grid of points X is random.
Lemma 3. Fix any T 0 ∈ N and δ ∈ (0, 1).Let X = {X 0 , . . ., X K } be a random set containing a random number K of points satisfying 0 = X 0 < X 1 < • • • < X K ≤ 1. Assume that the random variables K, X 0 , X 1 , . . ., X K+1 are H T 0 -measurable, where H T 0 is the history up to and including time T 0 .Assume that the process (V 1 , M 1 ), (V 2 , M 2 ), . . . of the valuations/highest competing bids form an i.i.d.sequence.Then, for all b ∈ [0, 1] and T 1 ∈ N with T 1 > T 0 , we have: We are now ready to present the main theorem of this section.
Theorem 2. Consider the problem of repeated bidding in first-price auctions in a stochastic i.i.d.environment with semi-transparent feedback.Then there exists a learning algorithm A such that Proof.We prove that Collecting Bandit yields the desired bound when its learning routine A is (a rescaled version of) MOSS [Audibert and Bubeck, 2009]: since MOSS is designed to run with gains in [0, 1] while the utilities we observe are in [−1, 1], we first apply the reward transformation Collecting Bandit (CoBa) 1: input: Time horizon T , bandit algorithm A for gains in [−1, 1] 2: T 0 ← ⌈T 2 /3 ⌉ 3: Run Collect Bids with horizon T 0 and obtain X 0 , X 1 , . . ., X K 4: Initialize A on K + 1 actions (one for each candidate bid X i ) and T − T 0 as time horizon 5: for each round t = T 0 + 1, T 0 + 2, . . ., T do

6:
Receive from A the bid B t = X It for some I t ∈ {0, 1, . . ., K}

7:
Post bid B t and observe feedback Z t 8: Reconstruct Util t (B t ) from Z t and feed it to A x → x+1 2 to the observed utilities.This costs a multiplicative factor of 2 on the regret guarantees of MOSS.Leveraging the fact that the empirical frequency between two consecutive X k and X k+1 generated by Collect Bids is at most 2 / √ T 0 by design and applying Lemma 3 with T 1 = T to the random variables X 0 , X 1 , . . ., X K , we get, for all b ∈ [0, 1], that Now, applying the tower rule to the expectation on the right-hand side conditioning to the history H T 0 up to time T 0 , we can use the fact that the regret of the rescaled version of MOSS is upper bounded by 98 (K + 1)(T − T 0 ) and the number of points K + 1 collected by Collect Bids is at most Finally, tuning δ = 1/T 0 , upper bounding the cumulative regret over the first T 0 rounds with T 0 , and recalling that T 0 = ⌈T 2 /3 ⌉, yields the conclusion.
3.2.2A T 2 /3 Lower Bound for the Smooth i.i.d.Environment We prove that the O(T 2 /3 ) bound achieved by Collecting Bandit is indeed optimal, up to logarithmic terms.Our lower bound consists in carefully embedding into our model a hard multiarmed bandit instance with K = Θ(T 1 /3 ) arms, which entails a lower bound of order Ω( √ KT ) = Ω(T 2 /3 ).This proof agenda involves various challenges: we want to embed a discrete construction of K independent actions into our continuous framework, where the utilities of different bids are correlated, while enforcing smoothness.Furthermore, the semi-transparent feedback is richer than the bandit one.We report here a proof sketch and refer the interested reader to Appendix A.2 for the missing details.
Let P 0 be a probability measure such that (V, M ), (V 1 , M 1 ), . . . is a P-i.i.d.sequence where each pair (V, M ) has common probability density function f .Denoting by E 0 the expectation with respect to P 0 , we have, for any bid b ∈ [0, 1] and any time step t This function grows with b on [0, 1 /4), has a plateau of maximizers [ 1 /4, 3 /4], then decreases on ( 3 /4, 1] (see Figure 2, right).We introduce the perturbation space Ξ: and define, for all (w, ε) ∈ Ξ, the four rectangles For all (w, ε) ∈ Ξ, we introduce the probability density function f w,ε as follows f w,ε = f + g w,ε , where the perturbation g w,ε is defined as follows We refer to the left plot in Figure 2 for a visualization of the support of the f w,ε .For all (w, ε) ∈ Ξ, let P w,ε be a probability measure such that (V, M ), (V 1 , M 1 ), (V 2 , M 2 ), . . . is a P w,ε -i.i.d.sequence where each pair (V, M ) has common probability density function f w,ε .Denoting by E w,ε the expectation with respect to P w,ε , we have, for any bid b ∈ [0, 1] and any t where Λ u,r is the tent map centered at u with radius r defined as Λ u,r (x) = max {1 − |x − u|/r, 0}.In words, in a perturbed scenario P w,ε the expected utility is maximized at the peak of a spike centered at w with length and height Θ(ε) perturbing the plateau area [ 1 /4, 3 /4] of maximum height (see Figure 2, right).Define, for all times t ∈ N, the feedback function and note that, in our semi-transparent feedback model, the feedback Z t received after bidding B t at time t is ψ t (B t ).Crucially, for each (w, ε) ∈ Ξ and each b ∈ [0, 1] \ [w − ε, w + ε], the distribution of ψ t (b) under P w,ε coincides with the distribution of ψ t (b) under P 0 .In push-forward notation(for a refresher on push-forward measures, see Appendix A.1), it holds that . (1) . At a high level, we built a problem with two crucial properties: (i) we know in advance the region where the optimal bid belongs to (i.e., the interval [ 1 /4, 3 /4]), but (ii) when the underlying scenario is determined by the probability measure P k , the learner has to detect inside this potentially optimal region where a spike of height (and length) Θ(ε) occurs (to avoid suffering suffer Ω(εT ) regret).This last task can be accomplished only by locating where the perturbation in the base probability measure occurs, which, given the feedback structure, can only be done by playing in the interval [w k − ε, w k + ε) if the underlying probability is P k , suffering instantaneous regret of order ε whenever the underlying probability is P j , with j ̸ = k.Given that we partitioned the potentially optimal region [ 1 /4, 3 /4] into Θ( 1 /ε) disjoint intervals where these perturbations can occur, the feedback structure implies that each of these intervals deserves its dedicated exploration.To better highlight this underlying structure, in Appendix A.2, we show that our problem is not easier than a simplified K-armed stochastic bandit problem, where the instances we consider are determined by the probability measures P 1 , . . ., P K .In this bandit problem, when the underlying probability measure is induced by some P k , the corresponding arm k has an expected reward Θ(ε) larger than the others.Then, via an information-theoretic argument, we can show that any learner would need to spend at least order of 1 /ε 2 rounds to explore each of the K arms (paying Ω(ε) each time) or else, it would pay a regret Ω(εT ).Hence, the regret of any learner, in the worst case, is lower bounded by Ω K ε 2 ε + εT = Ω K 2 + T K (recalling our choice of ε = 1 /(4K)).Picking K = Θ(T 1 /3 ) yields a lower bound of order T 2 /3 .For all missing technical details, see Appendix A.2.

I.I.D. -Transparent/Full Feedback
This section completes the study of the stochastic i.i.d.environment by determining the minimax regret when the learner has access to full or transparent feedback.

A
√ T Upper Bound for the i.i.d.Environment While with semi-transparent feedback, the learning algorithm has to rely on dummy bids B 1 = • • • = B T 0 = 0 to gather information about the distribution of the highest competing bids, with the transparent one, this information is collected for free at each bidding round.To use this extra information, we present a wrapper W.T.FPA (for a sequence of base learning algorithms for the transparent feedback model) whose purpose is restarting the learning process with a geometric step to update the set of candidate bids.We assume that each of the wrapped base algorithms A τ can take as input any finite subset X ⊂ [0, 1] and returns bids in X .Furthermore, for all T ′ , we let R T ′ ( A τ , X ) be an upper bound on the regret over T ′ rounds of A τ with input X against the best fixed x ∈ X .Formally, we require that for any two times T 0 < T 1 such that T ′ = T 1 −T 0 , the quantity R T ′ ( A τ , X ) is an upper upper bound on max x∈X E T 1 t=T 0 +1 Util t (x) − T 1 t=T 0 +1 Util t (B t ) , where B t ∈ X is the sequence of prices played by A τ (with input X ) when started at round t = T 0 + 1 and ran up to time T 1 .Without loss of generality, we assume that T ′ → R T ′ ( A τ , X ) is non-decreasing.W.T.FPA (Wrapper for Transparent First-Price Auctions) Start A τ with input X τ and run it for t = s + 1, . . ., s + 2 τ −1 6: Proposition 1.Consider the problem of repeated bidding in first-price auctions in a stochastic i.i.d.environment with transparent feedback.Then the regret of W.T.FPA run with base algorithms Proof.Fix an arbitrary epoch τ ∈ 2, . . ., ⌈log 2 (T + 1)⌉ ; we want to bound the regret suffered there by W.T.FPA using Lemma 3. Using the notation of the lemma, let X = X τ , K + 1 = |X |, T 0 = τ −1 τ ′ =1 2 τ ′ −1 = 2 τ −1 − 1 (the time passed from the beginning of epoch 1 up to and including the end of epoch τ − 1), T 1 = min{T 0 + 2 τ −1 , T } (the end of epoch τ ), and let X 0 < X 1 < • • • < X K be the distinct elements of X in increasing order, where we note that X 0 = 0, X K ≤ 1, and we set X K+1 = 2. Let H T 0 be the history, including time T 0 .Applying Lemma 3 (together with the fact that the empirical frequency between any two consecutive values X k and X k+1 is 0 by design), and exploiting the monotonicity of T ′ → R T ′ ( A τ , X τ ) for the last epoch (if T 0 + 2 τ −1 > T ), we obtain, for all b ∈ [0, 1] and δ ∈ (0, 1), Summing over epochs τ ∈ 2, . . ., ⌈log 2 (T + 1)⌉ , upper bounding by 1 the regret incurred in the first epoch, and tuning δ = 1 /T yields the conclusion.
Now we are only left to design appropriate base algorithms A 1 , A 2 , . . .for the transparent feedback to wrap W.T.FPA around.
The Exp3.FPA algorithm.To this end, we introduce the Exp3.FPA algorithm (designed to run with transparent feedback), which borrows ideas from online learning with feedback graphs [Alon et al., 2017].Similar algorithms for related settings have been previously proposed by Weed et al. [2016] and Feng et al. [2018].For the familiar reader, note that our setting can be seen as an instance of online learning with strongly observable feedback graphs.In contrast to a black-box application of feedback-graph results, we shave off a logarithmic term (in the time horizon) by using a dedicated analysis.For any x ∈ [0, 1], we denote by δ x the Dirac distribution centered at x.

4:
Post bid For all x ∈ X , define the reward estimate: For all x ∈ X , update the weight: Note that the transparent feedback is sufficient to compute the reward estimates in Line 5.
Proposition 2. Let X ⊂ [0, 1] be a finite set, T ∈ N a time horizon, and tune the exploration rate as γ = ln(|X |)/(e − 1)T .Then, the regret of Exp3.FPA against the best fixed bid in X is Notice that, for each t ∈ N, it holds that y≥Mt p t (y) ≥ γ.It follows, for each x ∈ X and t ∈ N, that γ g t (x) ≤ 1, and hence Then, for each t ∈ N, which implies Now, for each t ∈ N, let F t be the σ-algebra generated by p t , V t and M t and denote by First, notice that, for each t ∈ N and each x ∈ X and that It follows that, for each x ∈ X , Util t (B t ) + (e − 2)γT , which, after rearranging and upper bounding, yields Selecting γ as in the statement of the theorem leads to the conclusion.
Putting together Propositions 1 and 2 yields the desired rate.

A √ T Lower Bound for the i.i.d. Environment
We complement the positive result of Theorem 4 with a matching lower bound of order √ T .The idea underlying our hard instance is to embed the well-known lower bound for prediction with (two) experts into our framework: we construct two smooth distributions that are "similar" but have two different optimal bids whose performance is separated so that no learner can identify the correct distribution without suffering less than √ T regret.
Theorem 5. Consider the problem of repeated bidding in first-price auctions in a stochastic i.i.d.σ-smooth environment with full feedback, for σ ∈ (0, 1 /9].Then, any learning algorithm A satisfies Proof.We prove the theorem by Yao's principle: we show that there exists a distribution over stochastic σ-smooth environments such that any deterministic learning algorithm A suffers Ω( √ T ) regret against it, in expectation.We do that in two steps.First, for every ε ∈ (0, 1 /2) we construct a pair of 1 /9-smooth distributions that are hard to discriminate for the learner.Then, we prove that, for the right choice of ε, any learner suffers the desired regret against a uniform mixture of them.For visualization, we refer to Figure 3.As a tool for our construction, we introduce a baseline probability measure P 0 , such that the sequence (V, M ), (V 1 , M 1 ), (V 2 , M 2 ), . . . is P 0 -i.i.d., and (V, M ) has distribution P 0 (V,M ) (for a refresher on push-forward measures, see Appendix A.1) whose pdf is where Q + = (0, 1 /4) × (0, 1 /4) and Q − = ( 3 /4, 1) × ( 1 /4, 1 /2).A convenient way to visualize this distribution is to draw a uniform random variable U t in the square Q + and then toss an unbiased coin.If the coin yields heads, then (V t , M t ) is equal to U t , otherwise (V t , M t ) coincides with U t translated by ( 3 /4, 1 /4).With some simple computation, it is possible to explicitly compute the expected utility of posting any bid b ∈ [0, 1] when (V t , M t ) is drawn following the distribution P 0 (and expectation E 0 ): The function E 0 [Util t (b)] has two global maxima in [0, 1], of value 1 /128, attained in 1 /16 and 7 /16 (see purple line in Figure 3).For any ε ∈ (0, 1 /2), we also define two additional (perturbed) probability measures P ±ε , such that the sequence (V, M ), (V 1 , M 1 ), . . . is P ±ε -i.i.d. and the distribution P ±ε (V,M ) of (V, M ) has density: Note, ||f ±ε || ∞ < 9, while ||f 0 || ∞ = 8, therefore all the distributions considered in this proof are 1 /9-smooth.To visualize these new perturbed distributions, recall the construction of P 0 (V,M ) using the coin toss and the uniform random variable U : in this case, the coin is biased, and the probability of tails is (1±ε) /2.It is possible to explicitly compute the expected utility under these perturbed distributions for any bid b ∈ [0, 1]: For visualization, we refer to Figure 3 (bottom).The crucial property of the distributions we constructed is that the instantaneous regret of not playing in the "correct" region is Ω(ε); formally we have the following result.For the sake of readability, we postpone the proof of this claim to Appendix A.3.
Claim 1.There exists two disjoint intervals I + and I − in [0, 1] such that, for any ε ∈ (0, 1 /2) and any time t, the following hold: Since the two distributions are "ε-close ¶ ", any learner needs at least 1 /ε 2 rounds to discriminate which ones of the two distributions it is actually facing, paying each error with an instantaneous regret of Ω(ε) (Claim 1).All in all, any learner suffers a regret that is Ω(ε • 1 ε 2 + εT ), which is of the desired Ω( √ T ) order for the right choice of ε ≈ T − 1 /2 .As the last step of the proof, we formalize the above argument.Fix ε = 1 /(4 √ T ) and rename P +ε = P 1 and P −ε = P 2 .Similarly, denote with I 1 and I 2 the two intervals I + and I − as in the statement of Claim 1.For each j ∈ {0, 1, 2}, consider the run of A against the stochastic environment which draws (V 1 , M 1 ), (V 2 , M 2 ), . . .i.i.d.from P j .Let N 1 be the random variable that counts the number of times that algorithm A posts a bid in I 1 .Similarly, N 2 counts the number of times that it posts a bid in I 2 .For i = 1, 2, we have the following crucial relation between the expected value of N i under P i .Note, the results hold because the two distributions are so similar that the deterministic algorithm A bids in the wrong region a constant fraction of the time steps.For the formal proof of we refer the reader to Appendix A.3.
Claim 2. The following inequality holds: We finally have all the ingredients to conclude the proof.Consider an environment that selects uniformly at random either P 1 or P 2 and then draws the (V t , M t ) i.i.d.following it.We prove that the algorithm A suffers linear regret against this randomized environment and, by a simple averaging argument, against at least one of them.Specifically, if b ⋆ i is the optimal bid in the scenario determined by P i , for i ∈ {1, 2}, we have where ( * ) follows by Claim 1 and choice of ε, and (•) by Claim 2. ¶ In Appendix A.3 we formally prove that their total variation is at most Θ(ε).

The Adversarial Setting
In this section we complete the perspective on repeated bidding in first-price auction by investigating the adversarial environment.In particular, we consider two models: the standard one, where the sequence (V 1 , M 1 ), (V 2 , M 2 ), . . . is chosen upfront in a deterministic oblivious way, and the smooth environment, where the sequence (V 1 , M 1 ), (V 2 , M 2 ), . . . is some σ-smooth stochastic process.In Section 4.1 we construct an algorithm achieving T 2 /3 regret in the bandit feedback model under the smoothness assumption; this result, together with the lower bound of the same order for the semi-transparent feedback (Theorem 3) settles the problem for these two feedback regimes.Then, in Section 4.2 we provide another upper bound, namely an algorithm achieving √ T regret in the transparent feedback model under the smoothness assumption; this result, together with the lower bound of the same order for the semi-transparent feedback (Theorem 5) settles the problem for these two feedback regimes.Finally, in Section 4.3 we provide a lower bound proving that the non-smooth adversarial environment is too hard to learn, even when the learner has access to full feedback.

Smooth -Bandit Feedback
The smoothness assumption regularizes the objective function: if (V t , M t ) is smooth, then the expected utility is Lipschitz.
Proof.Let x > y be any two bids in [0, 1], we have: . Interestingly, we only need the marginal distribution of M t to be σ-smooth for the previous lemma to hold.This Lipschitzness property has the immediate corollary that any fine enough discretization of [0, 1] contains a bid whose utility is close the the optimal one.Lemma 5 (Discretization Lemma).Let X be any finite grid of bids in [0, 1], and let δ(X ) be the largest distance of a point in [0, 1] to X (i.e., δ(X Proof.Fix any such sequence and let b ⋆ a fixed bid such that where (L) follows by Lipschitzness and Lemma 4. The right-hand side with Equation (4) concludes the proof of the lemma.
We can combine the above discretization lemma with any (optimal) bandits algorithm to get the desired bound on the regret.For details, we refer to the pseudocode of Discretized Bandit.

Discretized Bandit
1: input: Time horizon T , bandit algorithm A for gains in [−1, 1], grid of K bids X 2: Initialize A on K actions, one for each x ∈ X , time horizon T 3: for each round t = 1, 2, . . ., T do

4:
Receive from A the bid B t ∈ X 5: Post bid B t and observe feedback Z t 6: Reconstruct Util t (B t ) from Z t and feed it to A Theorem 6.Consider the problem of repeated bidding in first-price auctions in an adversarial σ-smooth environment with bandit feedback.Then there exists a learning algorithm A such that Proof.We prove that algorithm Discretized Bandit with the right choice of learning algorithm A and grid of bids X achieves the desired bound on the regret.As learning algorithm A we use (a rescaled version of) the Poly INF algorithm [Audibert and Bubeck, 2010]: since Poly INF is designed to run with gains in [0, 1] while the utilities we observe are in [−1, 1], we first apply the reward transformation x → x+1 2 to the observed utilities.This transformation costs a multiplicative factor of 2 in the regret guarantees of Poly INF.The analysis builds on the discretization result in Lemma 5, by choosing as X the uniform grid of ⌈T 2 /3 ⌉ + 1 equally spaced bids on [0, 1] (note, δ(X ) becomes T − 1 /3 ).Fix any σ-smooth environment S, by Lemma 5, the following chain of inequalities holds: The second inequality follows from the guarantees of (the rescaled version of) Poly INF [Audibert and Bubeck, 2010, Theorem 11].

Smooth -Transparent Feedback
For transparent feedback, we combine two tools: the adversarial discretization result (Lemma 5) and the algorithm Exp3.FPA for learning with transparent feedback on a finite grid.Note, using any other √ KT black box learning algorithm (like in the previous section for bandits) would yield a suboptimal regret bound of T 2 /3 .Theorem 7. Consider the problem of repeated bidding in first-price auctions in an adversarial σ-smooth environment with transparent feedback.Then there exists a learning algorithm A such that Proof.Consider algorithm Exp3.FPA on the uniform grid X of ⌈ √ T ⌉ + 1 bids, with δ(X ) ≤ √ T .Fix any σ-smooth environment S, Lemma 5 implies the following: where the second inequality follows from Proposition 2.

The (Non-Smooth) Adversarial Model
The positive results provided in the previous sections hold under either one of two conditions: the environment is stochastic and the learner has at least the semi-transparent feedback (Theorem 1 says that bandit feedback is not enough) or the environment uses smooth distributions.These settings allow the learner to compute a discrete class of representative bids efficiently.In this section, we formally argue that learning is impossible if any of these assumptions is dropped.Specifically, the standard adversarial environment that generates the sequence without any smoothness constraint is too strong.In particular, we construct a randomized sequence (V 1 , M 1 ), (V 2 , M 2 ), . . .that induces any learner to suffer at least linear regret.This construction shares some similarities with the lower bound construction in Theorem 1, the main difference being that the best bid b ⋆ is randomized and hidden in such a way that even a learner having access to full feedback cannot pin-point it.
Theorem 8. Consider the problem of repeated bidding in first-price auctions in an adversarial environment with full feedback.Then, any learning algorithm A satisfies R T (A) ≥ T /24.
Proof.We prove the result via Yao's principle, showing that there exists a randomized environment S such that any deterministic learning algorithm suffers T /24 regret against it.The random sequence posted by S is based on two randomized auxiliary sequences L 1 , L 2 , . . .and U 1 , U 2 , . . .defined as follows.They are initiated to L 0 = 1 /2, U 0 = 2 /3.They then evolve recursively as follows: For each realized sequence of the (L t , U t ) pairs, the actual sequence of the (M t , V t ) selected by S is constructed as follows.At each time step t, the environment selects (M t , V t ) = (L t , 1) or (U t , 0), uniformly at random; so that the distribution is characterized by two levels of independent randomness: the auxiliary sequence of shrinking intervals and the choice between (L t , 1) and (U t , 0).We move our attention to the expected performance of the best fixed bid in hindsight.For each realization of the random auxiliary sequence, there exists a bid B ⋆ such that (i) it wins all the auctions (V t , M t ) of the form (L t , 1) (which we may call "good auctions" because they bring positive utility when won) and (ii) it loses all the auctions (V t , M t ) of the form (U t , 0) (called "bad auctions" because they bring negative utility).Thus its expected utility at each time step is at least 1 /6: with probability 1 /2 the environment selects a good auction, which induces a utility of (1 − L t ) ≥ 1 /3.All in all, the optimal bid achieves an expected utility of at least T /6.Consider now the performance of any deterministic algorithm A: for any fixed time t > 1 and possible realization of the past observations, the learner posts some deterministic bid B t .If B t < L t−1 , then it gets 0 utility, so we only consider the following cases:

A Appendix
A.1 Measure and Information-Theoretic Notation and Known Facts We recall that given two probability measures P and Q on a measurable space (Ω, F), Q is said to be absolutely continuous with respect to P (and we write Q ≪ P) if, for all E ∈ F such that P[E] = 0, it holds that Q[E] = 0. Whenever Q ≪ P, the Radon-Nikodym theorem states that there exists a density (called Radon-Nikodym derivative of Q with respect to P) dQ dP : Ω → [0, ∞) such that, for all E ∈ F, it holds that See [Bass, 2013, Theorem 13.4] for a reference.If (Ω, F, P) is a probability space, (X , F X ) is a measurable space, and X is a random variable from (Ω, F) to (X , F X ), the push-forward measure of P by X is denoted by P X .In this case, we recall that the push-forward measure is defined as the unique probability measure on F X defined via If (Ω, F) and (Ω ′ , F ′ ) are two measurable spaces, their product σ-algebra is denoted by F ⊗ F ′ .We recall that F ⊗ F ′ is the σ-algebra of subsets of Ω × Ω ′ generated by the collection of subsets of the form F × F ′ , where F ∈ F and F ′ ∈ F ′ .If (Ω, F, P) and (Ω ′ , F ′ , P ′ ) are two probability spaces, the product measure of P and P ′ is denoted by P ⊗ P ′ .We recall that P ⊗ P ′ is the unique probability measure defined on F ⊗ F ′ which satisfies for all E ∈ F and E ′ ∈ F ′ .If (Ω, F, P) is a probability space, (X , F X ) and (Y, F Y ) are measurable spaces, X is a random variable from (Ω, F) to (X , F X ), and Y is a random variable from (Ω, F) to (Y, F Y ), the conditional probability of X given Y is denoted by P X|Y , where, for each E ∈ F X , we recall that and that P X|Y [E] is a σ(Y )-measurable random variable.
The following result has been proven in Cesa-Bianchi et al. [2023].
Theorem 9. Suppose that (Y, d) is a separable and complete metric space with F Y as the Borel σalgebra of (Y, d).Let (Ω, F) be a measurable space, X a random variable from (Ω, F) to {0, 1}, 2 {0,1} , Y a random variable from (Ω, F) to (Y, F Y ), and U random variable from (Ω, F) to [0, 1], B , where B is the Borel σ-algebra of [0, 1].Suppose that P, Q are probability measures defined on F, and p ∈ (0, 1), q ∈ [0, 1] are such that: • U is a uniform random variable on [0, 1] both under P and Q, i.e., we have that • U is independent of X both under P and Q, i.e., P (X,U ) = P X ⊗ P U and Then, the following are equivalent: 1.There exists a measurable function φ from {0, 1} × [0, 1], 2 {0,1} ⊗ B to (Y, F Y ) such that 2. Q Y ≪ P Y , and P Y -almost-surely it holds that A.2 Missing Details of the Proof of Theorem 3 In this section, we will complete the proof of Theorem 3, showing that the repeated first-price auctions with semi-transparent feedback (in the following, referred to as "our problem") are no easier than a K-armed bandit instance based on the probability measures P 1 , . . ., P K introduced in Theorem 3. The structure of the proof is inspired by [Cesa-Bianchi et al., 2023, Section 3].
The related bandit problem.The action space is [K], where we recall that K was some arbitrarily fixed natural number.Let Y, Y 1 , Y 2 , . . .be a sequence of {0, 1} K -valued random variables such that, for any k ∈ {0, 1, . . ., K}, the sequence is P k -i.i.d.and, for all j ∈ [K] This sequence of latent random variables will determine the rewards of the actions.The reward function is (i, y) → 23 + 2y(i) 192 and the feedback received after playing an action I t at time t is Y t (I t ) (which is equivalent to receiving the bandit feedback ρ(I t , Y t ) gathered at time t).For any k ∈ {0, . . ., K} and any i ∈ [K] the expected reward is Mapping our problem into this bandit problem.Assume that K ≥ 3. We partition the interval [0, 1] in the following K disjoint regions: such that b ∈ J i (for a pictorial representation of the map ι, see Figure 4).
Simulating the feedback.To lighten the notation, besides the already defined random functions ψ 1 , ψ 2 , . . ., define also: The next lemma shows that we can use the feedback observed in the bandit problem together with some independent noise to simulate exactly the feedback of our problem.
Lemma 6.For each b ∈ [0, 1], there exists φ Proof of Lemma 6.A direct verification shows that, for all k ∈ Thus, for each b ∈ [0, 1], by Theorem 9, there exists (and we fix) We now show that any algorithm A for our problem can be transformed into an algorithm A to solve the bandit problem that suffers no-larger regret.To do so, we begin by formally explaining how algorithms for our problem work.
Functioning of an algorithm A for our problem A randomized algorithm A for our problem is a sequence of functions that take as input a sequence of random seeds U 1 , U 2 , . . .and some feedback Z 1 , Z 2 , . . .and generates bids B t as described below.At time t = 1, A selects a bid B 1 as a deterministic function of U 1 and observes feedback Z 1 = ψ 1 (B 1 ).Inductively, for any t ≥ 2, A selects a bid B t as a deterministic function of U 1 , . . ., U t , Z 1 , . . ., Z t−1 (where Z s = ψ s (B s ), for all s ∈ [t − 1]).For all k ∈ {0, . . ., K}, the sequence of seeds is a P k -i.i.d.sequence of uniform random variables on [0, 1] that is P k -independent of (V, M ), (V 1 , M 1 ), (V 2 , M 2 ), . . . .
Building A from A We show now how to map A to an algorithm A (that shares the same seeds for the randomization) for the bandit problem that suffers a worst-case regret that is no larger than that of A.
To do so, consider a sequence U ′ , U ′ 1 , . . . of random variables that, for all k ∈ {0, . . ., K} is a P k -i.i.d.sequence of uniforms on [0, 1] that A can access as a further source of randomness.We will assume that, for all k ∈ {0, . . ., K}, the four sequences Y, Y 1 , . . ., (V, M ), (V 1 , M 1 ), . . ., U, U 1 , . . .,and U ′ , U ′ 1 , . . .are independent of each other.The algorithm A acts as follows.At time 1, A plays the arm I 1 = ι(B ′ t ), where B ′ 1 = B 1 is the bid played by A at round t = 1 (chosen as a deterministic function of the random seed U 1 ).Then A observes the bandit feedback Y 1 ( I 1 ) and feeds back to A the surrogate feedback This way, we defined by induction the randomized algorithm A. By induction on t, one can show that, if B 1 , B 2 , . . .are the bids played by A on the basis of the feedback Z 1 = ψ 1 (B 1 ), Z 2 = ψ 2 (B 2 ), . . ., then, for all k ∈ {0, . . ., K}, we have which leads to (the last equality is a definition).Now we are left to show only that for any algorithm A for the bandit problem which plays actions I 1 , I 2 , . . ., there exists k ∈ [K] such that (the first equality is a definition).By Yao's Minimax principle, it is sufficient to show this for deterministic algorithms A for the bandit problem.
Lemma 7. Fix any deterministic algorithm A for the bandit problem on K actions, then there exists k ∈ [K] such that R k T ( A) ≥ 3 10 4 T 2 /3 .Proof.For any deterministic algorithm A for the bandit problem on K actions, let I 1 , I 2 , . . .be the actions played by A on the basis of the sequential feedback received Z 1 , Z 2 , . . .and define N t (i) as the random variables counting the number of times the learning algorithm A plays action i, up to time t, for any i ∈ [K] and any time t ∈ [T ]: We relate the expected values of N T (k) under P 0 and P k as a function of the expected number of times the algorithm plays the corresponding actions k.This formalizes the intuition that to discriminate between the different P k the learner needs to play exploring actions.
Claim 3. The following inequality holds true for any k ∈ [K]: Proof of Claim 3.For any t ∈ [T ], the action I t = I t (Z 1 , . . ., Z t−1 ) selected by A at round t is a deterministic function of Z 1 , . . ., Z t−1 , for each k ∈ [K].In formula, we then have the following P k (Z 1 ,...,Z t−1 ) − P 0 (Z 1 ,...,Z t−1 ) TV , where ∥•∥ TV denotes the total variation norm.We move now our attention towards bounding the total variation norm.To that end we use Pinsker's inequality and apply the chain rule for the KL divergence KL.For each k ∈ [K] and t ∈ [T ] we have the following: P 0 (Z 1 ,...,Zt) − P k (Z 1 ,...,Zt) TV ≤ 1 2 KL P 0 (Z 1 ,...,Zt) , P k Now, since ε = 1 4K ≤ 1 4 ≤ 2 3 , the following useful inequality holds: We can combine the inequalities in Equation ( 8) and Equation ( 9) into Equation ( 7) and plug in the bound in to obtain: Once we have this upper bound on the total variations of the random variables (Z 1 , . . ., Z t ) under P 0 and P k we can get back to the initial Equation ( 6) and obtain the desired bound via Jensen: Averaging the quantitative bounds in Claim 3 for all k in [K], and applying Jensen's inequality, we get the following: Now, we have all the ingredients to lower bound the average regret suffered by A. Note that every time a suboptimal arm is played the learner suffers (expected) instantaneous regret equal 1 144 • ε.

A.3 Missing Details of the Proof of Theorem 5
Claim 1.There exists two disjoint intervals I + and I − in [0, 1] such that, for any ε ∈ (0, 1 /2) and any time t, the following hold: Proof.For any ε ∈ (0, 1 2 ), the distributions P ±ε are such that, the set of all the bids that induce non-negative utility E ±ε [Util t (b)] is contained into two disjoint intervals I + = [0, 1 8 ] and I − = [ 1 4 , 1] ‖ .We consider separately the two cases P +ε and P −ε .We start from the former.By simply looking at the definition (2), it is clear that E +ε [Util t (b)] is monotonically increasing in ε for any b ∈ I + , on the contrary, it is monotonically decreasing for b ∈ I − .We have the following: We need a preliminary result for the proof of Claim 2. Recall, we use the same random variable (V, M ) to denote the highest competing bid/valuation pair drawn from the different probability distribution.When we change the underlying measure, we are changing its law.Consider now the push forward measures on [0, 1] 2 (with the Borel σ-algebra) induced by these three measures: P 0 (V,M ) , P +ε (V,M ) and P −ε (V,M ) .With some simple calculations (similarly to what is done in, e.g., Appendix B of Slivkins [2019]) it is possible to bound the KL divergence: ‖ The choice of I+ and I− is not tight.

Figure 2 :
Figure 2: Left: The support of the base density f lies inside the yellow and green regions.The perturbation g w,ε of f occurs inside the green region, where the four rectangles R 1 w,ε , . . ., R 4 w,ε (in red and blue) lie.Right: The corresponding qualitative plots of b → E[Util t (b)] (black, dotted) and p → E w,ε [Util t (b)] (red, solid).
b) § Note, we use the notation IA(x) to denote the indicator function that has value 1 when x ∈ A, and 0 otherwise.

Theorem 4 .
Consider the problem of repeated bidding in first-price auctions in a stochastic i.i.d.environment with transparent feedback.Then there exists a learning algorithm A such that R T (A) ≤ 3 The statement of the theorem holds for W.T.FPA run with the base algorithm of each epoch τ being Exp3.F P A tuned with γ = γ(τ ) = ln(|X τ |)/ ((e − 1)2 τ −1 ).Substituting the guarantees of Proposition 2 into those of Proposition 1 and recalling that |X τ | ≤ 2 τ −1 for each epoch τ = 2, 3, . . ., yields the desired bound.

Figure 3 :
Figure 3: The expected utility function for three different distributions: P 0 in purple, P + in orange, and P + in green.

Figure 4 :
Figure4: A representation of the map ι through which the bids in the first-price auction problem are related to the K-arms of the bandit problem.The interval [0, 1] is partitioned in K disjoint intervals, the first and the last one of length 1 /4 + 2ε, and all the ones in between of length 2ε.ι maps each bid to the index of the interval to which it belongs.
inductively, for any time t ≥ 2, assuming that A played arms I 1 , . . ., I t−1 and fed back to A the surrogate feedback Z ′ 1 , . . ., Z ′ t−1 , then 1.A plays the arm I t = ι(B ′ t ), where B ′ t is the bid played by A at round t (chosen as a deterministic function of the random seeds U 1 , . . ., U t and past surrogate feedback Z ′ 1 , . . ., Z ′ t−1 ). 2. A observes the bandit feedback Y t ( I t ) and feeds back to A the surrogate feedback Z