Learning to Bid in Contextual First Price Auctions

In this paper, we investigate the problem about how to bid in repeated contextual first price auctions. We consider a single bidder (learner) who repeatedly bids in the first price auctions: at each time $t$, the learner observes a context $x_t\in \mathbb{R}^d$ and decides the bid based on historical information and $x_t$. We assume a structured linear model of the maximum bid of all the others $m_t = \alpha_0\cdot x_t + z_t$, where $\alpha_0\in \mathbb{R}^d$ is unknown to the learner and $z_t$ is randomly sampled from a noise distribution $\mathcal{F}$ with log-concave density function $f$. We consider both \emph{binary feedback} (the learner can only observe whether she wins or not) and \emph{full information feedback} (the learner can observe $m_t$) at the end of each time $t$. For binary feedback, when the noise distribution $\mathcal{F}$ is known, we propose a bidding algorithm, by using maximum likelihood estimation (MLE) method to achieve at most $\widetilde{O}(\sqrt{\log(d) T})$ regret. Moreover, we generalize this algorithm to the setting with binary feedback and the noise distribution is unknown but belongs to a parametrized family of distributions. For the full information feedback with \emph{unknown} noise distribution, we provide an algorithm that achieves regret at most $\widetilde{O}(\sqrt{dT})$. Our approach combines an estimator for log-concave density functions and then MLE method to learn the noise distribution $\mathcal{F}$ and linear weight $\alpha_0$ simultaneously. We also provide a lower bound result such that any bidding policy in a broad class must achieve regret at least $\Omega(\sqrt{T})$, even when the learner receives the full information feedback and $\mathcal{F}$ is known.


Introduction
Recently, first price auctions have become the predominant auction mechanism on the major display advertising platforms, by replacing second price auctions [10,8].First price auctions have grown in favor because they are more transparent and credible [1], in the sense that there is no uncertainty in the final price upon winning [6].Compared with the second price auctions, first price auctions are no longer truthful, i.e. reporting the true value is not the optimal strategy for each advertiser.In light of this, advertisers face new challenges in practice: how should the advertiser bid in a first price auction when it is hard to know the others' bidding strategies?
In real display ads system, a huge number of online ads are sold repeatedly via auctions.If advertisers participate in auctions very frequently to compete for placing their ads, it is very important for them to optimize their bidding strategies in repeated auctions to maximize their long term rewards.In addition, advertisers may receive some contextual information of the queries before submitting bids including information of the publisher and the user.Given this context, the advertisers can estimate their value of this query and decide their bids to compete for the ad slots.In this work, we formulate the above problem as a standard contextual online learning problem.
A single advertiser (learner) repeatedly bids in contextual first price auctions and she observes a context x t ∈ R d before submitting bid at each time t.Then the learner submits a bid b t based on context x t and the seller use first price auctions to determine the winner and charge them their own bid.
In first price auctions, it is not enough for advertisers to bid optimally when they only know their own value, and it is necessary for them to understand the distribution of their competitors' bids [20].In this work, we assume a structured linear model of the maximum bid of the other bidders (other than this learner) m t = α 0 • x t + z t for some unknown α 0 ∈ R d , where z t ∼ F and the density function f of F is log-concave.This assumption provides a simple model for the maximum bid of the other competitors m t and x t especially in the absence of any additional characterization.For the learner, the learning task is to simultaneously learn α 0 and noise distribution F (if it is unknown).In repeated first price auctions, the learner can receive some information feedback at the end of each time.In this work, we provide no-regret learning algorithms for the learner in two different information models: (1) the partial information feedback, binary feedback, where the learner can only observe whether she wins or not; (2) the full information feedback, where the learner can observe m t after bidding at each time t.
Main Contributions.First, we characterize the optimal clairvoyant bidding strategy in contextual first price auctions, when we know the noise distribution F and parameter α 0 , in Section 2.3.Our characterization utilizes the log-concavity of the density function f of distribution F. This optimal clairvoyant bidding strategy is also used as the benchmark strategy in regret definition.
For binary feedback, we first assume F is (fully) known and we propose a no-regret learning algorithm that achieves at most O( log(d)T ) regret.Our algorithm is episode-based -at each episode s we use the estimated parameter αs−1 from previous episode s − 1 to decide the learner's current bids and update estimated parameter αs by only using the data from episode s at the end.This episodic algorithm is inspired by Cesa-Bianchi et al. [9] and is widely used in online learning literature and has a number of advantages, e.g., it requires less computation to update parameters of the model and it can be implemented offline at the end of each episode.We utilize the maximum likelihood estimation (MLE) method to estimate αs at each episode s.Moreover, we extend our algorithm to the setting that F is only partially known, i.e.F is parameterized by a known based distribution F 0 (e.g. standard normal distribution) and an unknown variance parameter σ 2 .The regret of our algorithm for this setting is still bounded by O( log(d)T ) under some reasonable technical assumptions.
For full information feedback, we consider the setting that F is unknown but f is still logconcave.We provide an episode-based algorithm that can simultaneously learn the noise distribution F (approximately) and the parameter α 0 .We propose a novel approach by combining the log-concave density estimator proposed in [12] and MLE method to learn F and α 0 simultaneously in each episode.With reasonable assumptions (normally assumed in linear regression), our algorithm achieves at most Õ( √ dT ) regret and it leaves an open question that whether we can improve the algorithm with better dependence of d.
Our final result shows the lower bound of regret for the full information feedback even with known noise distribution.We consider a broad class of bidding policies and prove any algorithm in this class must incur Ω( √ T ) regret for a instance.Despite the simple structure (greedy episodic structure), our algorithms require novel ideas and non-trivial technical contributions.In the full feedback model, we propose a new approach to combine an estimator for the log-concave functions with the MLE technique.To prove the regret bound of the algorithm, we provide a new uniform convergence bound for the log-concave noise distribution (see Theorem 4.2).This has not been known and requires a delicate balance of the parameters.In the binary feedback model with partially known noise distribution, our algorithm achieves regret bounded by O( log(d)T ).Our result improves the regret bound O(d √ T ) proposed by Javanmard and Nazerzadeh [19] for a similar setting, by slightly strengthening the assumption of covariance matrix Σ = E[x t x T t ] (see Assumption 3.2).
Related Work.First price auctions have recieved a lot of attention in mechanism design and machine learning communities recently.For instance, Wang et al. [23] characterizes the Bayesian Nash Equilibrium for first price auctions with discrete value and continuous bid, Balseiro et al. [5] study the equilibrium bidding strategies of contextual first price auctions with budgets, and Feng et al. [13] propose a gradient-based approach to adaptively update and optimize reserve prices for first price auctions in an online manner.
Our work is closely related with the papers in the Learning to bid literature.The work in [4] first considers the problem of learning to bid in first price auctions by treating the value as a context.Subsequently, Han et al. [17,16] extended the above learning to bid model to other settings with different feedback models and different generative models for competitors' bids 1 .The main difference between our model and the above papers is that there is a public context (feature) x t ∈ R d observed before bidding at each time t and the learner needs to decide her bid based on the value and context x t .Our model allows more flexibility of the correlation between valuation and the competing bids (through context x t ), compared with [4,17,16].This is more realistic in practice, since the learner can observe some contexts before submitting the bid and she knows this will affect the competing bid as well.Loosely related works Weed et al. [24], Feng et al. [14], consider the problem that the learner can only observe the value until she wins the auction.
Last but not least, our work is also related with papers in the contextual pricing field, e.g.[22,21,19,15,11].Especially, Javanmard and Nazerzadeh [19] also assume log-concavity of the noise distribution for the valuation function in the contextual pricing problem.For the binary feedback model, our approaches generalize the methodology for contextual pricing in [19] to the bidding algorithms in the repeated contextual first price auctions.Javanmard and Nazerzadeh [19] focus on the high-dimension setting that α 0 is sparse and the feature dimension d is larger than T .The algorithm proposed in [19] utilizes MLE with L 1 regularizer and it can achieve O (s 0 log d log T ) regret bound for the binary feedback model with known noise distribution and the bounded eigenvalue assumption of matrix Σ = E[x t x T t ], where s 0 is the sparsity parameter for α 0 , s.t.α 0 0 ≤ s 0 .In this paper, we don't assume the sparsity of context x t and we also consider the setting with full information feedback and unknown noise distribution.In addition, Golrezaei et al. [15], propose a different way to learn noise distribution and linear weight simultaneously for contextual pricing by using ordinal least square (OLS) method, however, this approach requires the noise distribution bounded or sub-Gaussian and cannot be applied in our full information feedback model.To address this difficulty, we propose a novel approach by combining the non-parametric log-concave density estimator and MLE method to design our bidding algorithm for the full information feedback model with unknown noise distribution.

Repeated Contextual First Price Auctions
We consider the problem of online learning in repeated contextual first price auctions.There is a single seller who repeatedly sell items (e.g.ad slots in publishers) to multiple bidders (e.g.advertisers) through first price auctions.Throughout this paper, we focus on a single bidder in a large population of bidders during a time horizon T .In the rest of paper, we call this single bidder the learner, who aims to maximize cumulative utility during time horizon T .
At each time t, the learner receives a public context x t ∈ X ⊆ R d , x t 2 ≤ 1 (x t is also revealed to the other bidders and the seller), and x t is i.i.d randomly generated from a prior unknown distribution D. Based on the learner's historical information up to time t − 1 and the realization of context x t , the learner submits a bid b t ∈ [0, 1].Let the maximum bid of all other bidders is m t at time t.We assume there exists a known valuation function β 0 : x ∈ X → [0, 1], which outputs the value of the learner given the input context.In other words, at each time t, given the realized context x t , the learner can get the value v t = β 0 (x t ). 2 In this paper, for simplicity, we assume the linear model of the context x t and maximum bid of all other bidders m t , i.e. there exists an unknown parameter α 0 ∈ R d and α 0 1 ≤ W s.t.
where z t i.i.d sampled from an unknown mean zero distribution F and W is a publicly known parameter.
For notation simplicity, we allow m t be negative, which will not affect our regret results.We call F the noise distribution and also use f and F to represent the probability density function (PDF) and cumulative distribution function (CDF) of the noise, i.e., f (v) = F ′ (x), ∀x ∈ R. For notation simplicity, we denote ϕ(x) = x + F (x) f (x) .Let u(b, x) be the expected utility of the learner with bid b, given a context x, s.t.
It is easy to see u(b, x) ∈ [0, 1], for any b, x and b ≤ β 0 (x).For notation simplicity, we denote u t (b) := u(b, x t ) as the utility of the learner at time t with context x t .
Feedback Models.In the repeated contextual first price auctions, the learner can receive different feedback at the end of each time depending on the information released from the seller.In this paper, we mainly investigate two different feedback models, 1. Binary feedback: the learner only observes the indicator δ t = I{b t ≥ m t }.
2. Full information feedback: the learner observes the maximum bid of all other bidders m t .
Regret.Let π * (x) be the optimal clairvoyant bidding strategy, suppose the learner knows α 0 , β 0 and distribution F , i.e. π * (x) = arg max b u(b, x).The target of the learner is to design a bidding strategy to decide the bid b t at each time t, defined in the following, Definition 2.1 (Regret).The regret of the learner during time horizon T can be defined as, Here b t depends on the past history (the realization of x τ , δ τ , etc. τ ≤ t−1), thus (β 0 (x t )−b t )F (b t − α 0 , x t ) is a random variable.

Technical Assumptions
In addition to the linear model assumption for m t , we make several assumptions of noise distribution F for the theoretical purpose.
Assumption 2.2.The density function f is differentiable and log-concave.
Log-concavity is a widely-used assumption in the economic literature [3].Note that if the density function f is log concave, then the cumulative distribution function F and the reliability function 1 − F are both log-concave [2].Most common distributions such as normal, uniform, Laplace, exponential and logistic distributions satisfy the above assumption.
In addition, we provide the following assumption of the density function f , Indeed, the above assumption holds for any distribution with differentiable density function Eq. ( 4) holds trivially for any bounded interval [−W, 1 + W ] and Eq. ( 5) holds because F and 1 − F are both log-concave.

Optimal Clairvoyant Bidding Policy
In this part, we consider the optimal clairvoyant bidding strategy if the learner knows α 0 and noise distribution F. Consider the utility of the learner at time t, )) be the optimal bid at time t, given context x t .Suppose b * t ≥ 0, by the first-order condition, we have By the definition of function ϕ, we have the following proposition.
Proposition 2.4.ϕ(•) is a strictly increasing function and 0 The above proposition implies the optimal bid b * t , given context x t can be represented as, Given the characterization of the above optimal clairvoyant bidding strategy, we can rewrite the regret: 3 Binary Feedback Model In this section, we consider the least information feedback model -binary feedback that the learner can only observe whether she wins or not at the end of each time.

Binary Feedback with Known Noise Distribution
In this section, we assume the learner knows the noise distribution F, i.e., f and F are known.In this case, the learner only needs to learn α 0 .
Algorithm.Our bidding algorithm runs in an episode manner, similarly to Cesa-Bianchi et al. [9], Javanmard and Nazerzadeh [19].During a time horizon T , the bidding algorithm is divided into S episodes, where each episode contains T s time steps.Denote Γ s be the time steps in stage s, s.t.|Γ s | = T s .For any time step t in the first episode, we simply set b t = 1.For any time step t in episode s(s ≥ 2), i.e., t ∈ Γ s , we set the bid for the learner, where αs−1 is the estimation of α 0 based on the observations {x t , δ t , b t }, t ∈ Γ s−1 in the previous episode s − 1.Indeed, we replace α 0 by αs−1 in the optimal clairvoyant bidding policy shown in Eq. ( 7) to set the bid b t at time t, ∀t > T 1 .If the estimator αs−1 is close to α 0 based on the observations in the episode s − 1, the expected utility u t (b t ) will be close to the optimal expected utility u t (b * t ) (see Lemma B.1).Given the above definition, we show the pseudo code of our bidding algorithm for this setting in Algorithm 1.In each episode s, we estimate α 0 by using maximum likelihood estimation (MLE) method.Specifically, we notice at each time t, Therefore, we denote L s (α) be the negated log-likelihood function for α in the episode s, where δ t = I{b t ≥ m t }.Indeed, based on our log-concavity assumption on F and 1− F , the negated log-likelihood L s (α) is convex for any s = 1, 2, • • • , S. Therefore, we can run standard gradient descent algorithm to minimize loss function L s (α).
ALGORITHM 1: Bidding algorithm in the binary feedback model with known noise distribution The learner observes x t and submits a bid b t = 1.The learner observes δ t .end Estimate α 0 by using α1 , which is computed by α1 = arg min The learner observes x t and submits b t , where b t is computed in the following way, The learner observes δ t .end Update the estimator for α 0 in episode s by αs = arg min α 1≤W L s (α).end Regret Analysis.We show the regret bound for the setting considered in this subsection as below, and the full proof is deferred to Appendix B. Proof Sketch.Our proof follows the same spirit as in Theorem 4 in [19].First, we bound Remark.Theorem 3.1 doesn't rely on the assumption of the bounded eigenvalue of matrix Σ = E[x t x T t ] and the sparsity assumption of parameter s 0 .With these two assumptions, Javanmard and Nazerzadeh [19] show it can achieve O(s 0 log d log T ) regret bound for this binary feedback model with known noise distribution.

Extension to Partially-Known Noise Distribution
In this section, we extend to the case that the noise distribution F is parameterized by a zero-mean base noise distribution F 0 and a variance σ 2 (σ > 0), where F 0 is known (e.g.N (0, 1)) but σ is unknown.We denote ρ 0 = 1 σ .Without loss of generality, we assume |ρ 0 | ≤ W .In this case, the learner needs to simultaneously learn α 0 and σ.We denote f 0 and F 0 be the density function and cumulative function of distribution F 0 , which are known to the learner.Let ϕ 0 (x) = x + F 0 (x) f 0 (x) .
Modified Algorithm.The algorithm follows the same fashion of Algorithm 1 and we show the pseudo code in Algorithm 2 in Appendix E. The main difference in this algorithm is how to estimate α0 and ρ 0 simultaneously.By the definition of δ t = I{b t ≥ m t }, we observe δ t = 1 with probability To simplify presentation, we re-parametrize α 0 , ρ 0 by denoting µ 0 = α 0 ρ 0 and write the negated log-likelihood function in each episode s as follows, In Algorithm 2, we always update the estimator (μ s , ρs ) of (µ 0 , ρ 0 ) in a valid set Λ by minimizing the loss function (Eq.11) at the end of each episode s.
Then, for any (µ, ρ) For each time step t in the first episode, we set the bid b t = 1.For any time t in episode s, we set the bid As the astute readers may notice, we only consider the case that the bids are larger than a small positive constant ∆ (i.e. one cent), in this setting.For theoretical purpose, the assumption that b t ≥ ∆ guarantees the strong convexity of L s (µ, ρ) w.r.t ρ so that we can bound (ρ s − ρ 0 ) 2 in each episode.For practical perspective, this assumption holds trivially since the display ads platform usually requires a minimum amount of bid, e.g. one cent, to compete for ad slots.We replace (µ 0 , ρ 0 ) by the estimator (μ s−1 , ρs ) in the optimal bidding policy 3 .
For theoretical purpose, we need the following assumption on the product context, The assumption on the bounded eigenvalues of matrix Σ is commonly proposed in the convergence analysis of the linear models.It is well-known Σ − E[x t ]E[x t ] T is positive semi-definite and we strengthen it to be strictly positive definite here.Indeed, the above assumption holds for many common probability distributions of context x t , such as uniform, truncated normal and in general truncated version of many more distributions.
Regret Analysis.To begin with, we state the benchmark in the regret analysis considered in this section.To be consistent with our bidding space, we consider a slightly weaker but practical benchmark, i.e. the bids are all truncated above ∆.Therefore, for any realized context x t , the optimal bidding policy (benchmark) is Comparing with this benchmark, we state our main theorem of the regret bound in this section, and the proof is deferred to Appendix B.2. Theorem 3.3.Suppose Assumptions 2.2, 2.3 and 3.2 hold, setting T s = T 1−2 −s , s = 1, 2, • • • , then with probability at least 1−δ, the regret (w.r.t the benchmark defined in Eq. ( 14)) achieved in the binary feedback model with partially known noise distribution is bounded by where O ignores log T and log log T terms.
3 Since F0 and 1 − F0 are both log-concave, the optimal clairvoyant bidding policy (without truncation to ∆) is Remark.Javanmard and Nazerzadeh [19] study the contextual pricing problem in a very similar setting, i.e. the noise distribution of valuation belongs to a known (parameterized) class with unknown parameters.Our result improves the regret bound O(d √ T ) proposed in [19] by using a slightly stronger assumption of Σ (Assumption 3.2) 4 .In our proof, we show the loss function L s (µ, ρ) are strongly convex with high probability, in Lemma B.6.The proof for this Lemma utilizes Schur Complements and advanced matrix inequalities.

Full Information Feedback Model
In this section, we consider the full information feedback model with unknown noise distribution, i.e., the learner has no information of noise distribution, however she can always observe the highest bid of all other bidders m t .Without knowledge of noise distribution F, the learner cannot directly use naive MLE method to estimate α 0 used in Section 3.1.
Following the same spirit as in Section 3.1, we still build our algorithm be episode-based, i.e. at each episode s, we use the estimated noise distribution Fs−1 and parameter αs−1 from the (s − 1)th episode to determine the learner's bid and only update these estimators at the end of episode s by the using the data observed in episode s.The main difficulty is how to update the estimators of F and α 0 in each episode.To handle this challenge, we propose a new approach, combining the non-parametric log-concave density estimator and MLE method, to learn α 0 and F simultaneously.
Non-parametric estimation of f .We first introduce the non-parametric estimation of density function f , given any linear weight estimator α.This non-parametric estimator of f is from [12] and we generalize it here to incorporate with different estimation of α 0 .In each episode s, given realized x t , m t , t ∈ Γ s and any linear weight estimator α.
For notation simplicity, it is without loss of generality to re-parameterize f (z) = exp(Ψ(z)), where Ψ(z) is a concave function w.r.t z.Then given any linear weight α, it is equivalent to optimize estimator Ψs (•; α) to get an estimator fs (•; α) in each episode s, in the following, Let F s be the empirical distribution of noise samples {z t } t∈Γs in episode s, therefore we have F s (z) = 1 Ts t∈Γs I{z ≤ z t }.Dümbgen and Rufibach [12], characterizes the optimizer Φs (•; α) as well as estimator Fs (z; α) when α = α 0 , in the following, Lemma 4.1 ([12]).The optimizer Φs (•; α 0 ) exists and is unique.For any z Give the above characterization of Φ(•; α 0 ) and Fs (•; α 0 ) we provide the uniform convergence bound for | Fs (z; α 0 ) − F (z)| in the following Theorem, the proof is deferred to Appendix C.1.
holds with probability at least 1 − δ.
Algorithm.Similarly, we assume there are S episodes in the algorithm, each episode s contains T s time steps, and Γ s be the set of time steps in episode s.Given the non-parametric estimator of Fs (•; α) introduced in the above, we introduce our algorithm for the full information feedback model: • For any time step t in the first episode, the learner sets the bid b t = 1.
• For any time step t in episode s(s ≥ 2), i.e. ∀t ∈ Γ s , the learner sets the bid where αs−1 is the estimator of α 0 based on the data observed in episode s − 1 and Fs−1 (•; αs−1 ) is the estimator of noise distribution (CDF) shown in Eq. (15).To compute αs , we minimize the following MLE loss function, where ε t (α) = b t −α•x t and α 1 ≤ W .The pseudo-code is presented in Algorithm 3 in Appendix E. Indeed, L s (α) is convex almost everywhere, since Fs and 1 − Fs are both log-concave based on our construction.We can still solve this optimization problem by gradient descent approach, but we need to recompute Fs (•; α) to get the gradient of loss function L s at α in each iteration of gradient descent.If we can compute Fs efficiently, combining with gradient descent approach, our algorithm is computationally efficient.In this paper, we focus on regret analysis and leave the computational efficiency argument as a future direction.
Regret Analysis.We provide the regret bound for the full information feedback model in the following Theorem.The full proof is deferred to Appendix C.2.

Lower Bound
In this section, we show the lower bound of regret for the full information feedback model with known noise distribution, i.e.F is known and m t is always realized at the end of each time t.
As we know, if α 0 is known, the optimal bidding strategy is be the history observed up to time t and we consider the following set of bidding policies, Π: Here α t can be regarded as an (inaccurate) estimator of α 0 and Π captures a wide class of informational bidding policies 7 .Indeed, when we restrict our attention on the bidding policies in Π, we can derive any bidding policy π ∈ Π must incur expected Ω( √ T ) in the following theorem.The proof is rather technical and we defer it to Appendix D. Theorem 5.1.For any T , we assume that the market value z t , 1 ≤ t ≤ T are fully observed.We further assume z t ∼ N (0, σ 2 ), where σ is known.Let Π be the set of bidding polices π defined in Eq. ( 19), then any bidding policy π must incur expected regret Ω( √ T ).

Future Work
In this paper, we assume the linear model of m t w.r.t.context x t and a natural future direction is to extend to non-linear model.We assume the context x t is randomly sampled from a fixed, prior unknown distribution.It will be interesting to design a no-regret bidding algorithm for contextual first price auctions when the context is generated from adversary.In the future, we are interested in generalizing our algorithms to other contextual untruthful (beyond first price auctions).In addition, we assume the learner can estimate the value β 0 (x t ) before submitting the bid and it would be exciting to incorporate with the setting that the learner cannot observe the value unless she wins the auctions.

A Useful Technical Lemmas
Lemma A.1 (Schur complement [18]).Let where A positive definite (invertible) and C is symmetric, then the matrix Lemma A.2 (Matrix Inverse Lemma).Let A be invertible, for any constant λ > 0, we have To prove Theorem 3.1, we introduce some auxiliary lemmas.Our proof is inspired by Javanmard and Nazerzadeh [19].First we bound the difference between optimal expected utility and the expected utility achieved by our bidding algorithm at each time t in the episode s(s ≥ 2) by Θ(|x t • (α 0 − αs−1 )| 2 ).The proof involves an case analysis.
Lemma B.1.For any s ≥ 2 and any t ∈ Γ s , let b * t be the optimal bid given context , where B 2 and B 3 are positive constants defined in Assumption 2.3.

Proof. Firstly, for any
By second-order Taylor's theorem, we have for some b between b t and b Then by second-order Taylor's theorem, we have for some b between b t and b * In summary, we have where the second inequality holds because (ϕ −1 ) ′ (x) ≤ 1 for all x.
The following lemma is a technical lemma which is used to bound the regret in each episode, as shown in the proof for Theorem 3.1 later.The proof is technical and we leave it to Appendix B. Lemma B.2.In each episode s ≥ 1, we have holds with probability at least 1 − δ/2S.
Proof.By second-order Taylor's theorem, we have for some α on the line segment between α 0 and αs .Given the definition of L s (α), we have where η t (α) and ζ t (α) are defined as follows, Based on our construction of the algorithm, x t , b t is independent with z t .Thus, b t − α 0 , x t are independent with z t for any t ∈ Γ s , then we have Then by Hoeffding's inequality and union bound over each coordinate of ∇L s (α 0 ).
Given the above lemmas, we now turn to prove Theorem 3.1.
Proof of Theorem 3.1.We first bound the total regret in each episode s ≥ 2 in the following way, where C = 2B 2 + B 3 .Then we decompose the term α 0 − αs−1 , Σ(α 0 − αs−1 ) in the following way, , Then by Hoeffding's inequality and union bound over all indices i, j ∈ [d], we have with probability at least 1 holds.Combining with Lemma B.2, for any s ≥ 2, we have holds with probability at least 1 − δ S .Therefore, by union bound over all stages s = 2, • • • , S, with probability at least 1 − δ, the total regret is bounded by where the first inequality holds because u t (•) is bounded by [0, 1].Finally we bound S, it is easy to verify S ≤ log log T .Thus, we complete the proof.
Theorem 3.3.Suppose Assumptions 2.2, 2.3 and 3.2 hold, setting T s = T 1−2 −s , s = 1, 2, • • • , then with probability at least 1−δ, the regret (w.r.t the benchmark defined in Eq. ( 14)) achieved in the binary feedback model with partially known noise distribution is bounded by where O ignores log T and log log T terms.
To begin with, it is straightforward to show the following two propositions, which can be directly derived from Assumption 2.2.
Proposition B.3.The density function f 0 is differentiable and log-concave.

Proposition B.4. There exists positive constants h
To prove Theorem 3.3, we provide some auxiliary lemmas presented in the following.Lemma 14 provide a bound of the difference between the optimal expected utility and the expected utility achieved by our algorithm in at each time t in episode s.
Given the above lemma, to bound the regret, we need to bound | μs − µ 0 , x t | 2 and (ρ s − ρ 0 ) 2 simultaneously in each episode s.First, we show L s (µ, ρ) is γ-strongly convex with high probability in Lemma B.6.
Denote the minimum eigenvalue of Σ as λ 0 > 0. We set Next we would like to prove E xt xt T λ * I with high probability, where I is identity matrix.Denote matrix Then we have, On the other hand, it is trivial to show Σ − λ * I 0. Therefore, we prove M ′ 0, which is equivalent to E[x t ]E[x t ] T λ * I.In addition, λ max (x t xT t ) ≤ 1 since x t ∞ ≤ 1 and b t ≤ 1 for any time t.Then by Matrix Chernoff bound, we have, Setting δ = 2S(d + 1)e −2ε 2 Ts , we have, with probability at least 1 -strongly almost everywhere.When T s is sufficiently large such that Given the strong convexity of L s (µ, ρ), we can bound the L 2 distance (μ s , ρs ) − (µ 0 , ρ 0 ) 2 2 in the following lemma, Lemma B.7. Suppose Assumption 3.2 holds.For any δ ∈ (0, 1), we have holds with probability at least 1 − δ S , where γ is defined in the statement of Lemma B.6.
Given the above technical lemmas, we are ready to prove Theorem 3.3 shown as below, Proof of Theorem 3.3.Let b * t = arg max b u(b, x t ).Then the regret achieved in each episode s can be represented as follows, for some constant C ′ depending on C, λ holds with probability at least 1 − δ S , where γ is defined in the statement of Lemma B. Ts .For any z ∈ [−W, 1 + W ], the probability that there exists a point y ∈ {z t } t∈Γs such that |y − z| ≤ r s , is at least for some α on the line segment between α and αs .Given the definition of L s (α; κ s ), we have where η t (α) and ζ t (α) are defined as follows, Based on our construction of the algorithm, x t , b t is independent with z t .Therefore, ε t (α 0 ) = b t − α 0 , x t are independent with z t for any t ∈ Γ s , we have Thus, by Theorem 4.2 and Proposition C.1, we have holds with probability at least 1 − δ/4S.Then by Hoeffding's inequality and union bound holds with probability at least 1 − δ/2S.By the optimality of αs , Invoking into Eq.( 20), we have In addition, by Proposition C.1, we have ζ t (α) ≥ lW .Recall X s represent the context matrix with rows x t , t ∈ Γ s , corresponding to T s auctions in episode s.Then the above inequality implies that holds with probability at least 1 − δ 2S .The second inequality holds because Proposition C.28 and the third inequality is based on Cauchy-Schwartz inequality.
Proof.Let Fs be the empirical distribution of samples {m t − αs • x t } t∈Γs , i.e.
First, we give a uniform convergence bound for | Fs (z) − F (z)|.The proof is analogous to the proof of Lemma 1 in [15].The main challenge is that we cannot directly apply DKW inequality, since αs depends on z t , t ∈ Γ s .To handle this challenge, we bound the lower bound and upper bound of Fs (z) separately.Since αs − α 0 2 ≤ κ s and x t 2 ≤ 1, we have Thus, conditioned on αs − α 0 2 ≤ κ s , for any γ > 0, we have and A is defined in Lemma C.3, then the regret achieved in episode s can be bounded, where the second inequality holds when Lemma C. To prove this Theorem, we first propose the following auxiliary lemmas.Lemma D.1.Suppose b * t > 0, there exists a constant c 1 > 0 (depending on W and σ) such that u ′′ t (b * t ) < −c 1 for any t ≥ 1.Further, there exists constant δ > 0 (depending on W and σ) such that u and G(•) be the PDF and CDF of the standard normal distribution, respectively.Then we can write the utility function and its derivatives in the following way, Considering the optimal bid b * t > 0, by the first-order conditions, we have In addition, we have Lemma D.2 (Javanmard and Nazerzadeh [19]).Let x ∈ R d be a random vector such that its coordinates are chosen independently and uniformly at random from {−1, 1}.Further, suppose that v ∈ R d and δ > 0 are deterministic.Then, Lemma D.3 (Javanmard and Nazerzadeh [19]).Consider the linear model (1) and assume that the maximum bids from competitors m t , 1 ≤ t ≤ T , are fully observed and the context x t is i.i.d generated such that its coordinates are chosen independently and uniformly at random from {−1, 1} at each time t.We further assume that the noise in market value is generated as z t ∼ N (0, σ 2 ).
Then, conditional on historical contexts (x where the first inequality holds because G((δ −α 0 •x t )/σ) ≤ 1 and the second inequality holds based on the property of standard normal distribution and α 0 • x t ≤ W .Given the above inequality, we have Therefore b * t ≥ δ, by the second order Taylor's theorem, for some b between b t and b * t .For any b t generated from a bidding policy in Π, • When b t > 0, we have  Then we can lower bound the min-max regret of any policy in Π. ≥ Ω T log(T )

Theorem 3 . 1 .
Suppose Assumption 2.3 and Assumption 2.2 hold, setting T s = T 1−2 −s , s = 1, 2, • • • , then with probability at least 1 − δ, the regret achieved in the binary feedback model with known noise distribution is at most R(T ) ≤ O log(d/δ)T , where O omits log log T terms.

Theorem 4 . 3 .
[Regret Bound] Suppose Assumptions 2.2, 2.3, and 3.2 hold 6 , and T is sufficiently large.Given T s = T 1−2 −s , s = 1, 2, • • • , then with probability at least 1 − δ, the regret is bounded by R(T ) ≤ O d log(d/δ)T , where O ignores log T and log log T terms.Proof Sketch.The main challenge in this proof is to bound the difference between estimator αs and α 0 , as well as the distance between Fs and F .First, we give a bound of L 2 distance between αs and α 0 in Lemma C.3.The proof of this Lemma strictly generalizes the idea of Theorem 3.1, combining with the uniform convergence bound of | Fs (z; α 0 )−F (z)| in Theorem 4.2.Then, we show if αs − α 0 2 ≤ O( d log(Ts) Ts ) holds, then | Fs (z; αs ) − F (z)| ≤ O( d log(Ts) Ts ) for all z ∈ [−W, 1 + W ] holds with high probability in Lemma C.4.Since Lemma C.3 implies αs − α 0 2 ≤ O( d log(Ts) Ts ) holds with high probability, then we provide a uniform convergence for | Fs (z; αs ) − F (z)|.

B Omitted Proofs from Section 3 B. 1 1 Theorem 3 . 1 .
Proof of Theorem 3.For convenience, we restate the theorem: Suppose Assumption 2.3 and Assumption 2.2 hold, setting T s = T 1−2 −s , s = 1, 2, • • • , then with probability at least 1 − δ, the regret achieved in the binary feedback model with known noise distribution is at most R(T ) ≤ O log(d/δ)T , where O omits log log T terms.

5 D. 1
3 holds and Lemma C.4 (setting δ := δ/2S) holds simultaneously.Then the inequality holds with probability at least 1 − δ/S by union bound.By union bound over S episodes and the fact that Ts T s−1 = √ T and S ≤ log log T , we complete our proof.D Omitted Proofs from Section Proof of Theorem 5.1