No-Regret Learning in Bilateral Trade via Global Budget Balance

Bilateral trade models the problem of intermediating between two rational agents — a seller and a buyer — both characterized by a private valuation for an item they want to trade. We study the online learning version of the problem, in which at each time step a new seller and buyer arrive and the learner has to set prices for them without any knowledge about their (adversarially generated) valuations. In this setting, known impossibility results rule out the existence of no-regret algorithms when budget balanced has to be enforced at each time step. In this paper, we introduce the notion of global budget balance, which only requires the learner to fulfill budget balance over the entire time horizon. Under this natural relaxation, we provide the first no-regret algorithms for adversarial bilateral trade under various feedback models. First, we show that in the full-feedback model, the learner can guarantee Õ(√T) regret against the best fixed prices in hindsight, and that this bound is optimal up to poly-logarithmic terms. Second, we provide a learning algorithm guaranteeing a Õ(T 34) regret upper bound with one-bit feedback, which we complement with a Ω(T 57) lower bound that holds even in the two-bit feedback model. Finally, we introduce and analyze an alternative benchmark that is provably stronger than the best fixed prices in hindsight and is inspired by the literature on bandits with knapsacks.


Introduction
Bilateral trade is a classic economic problem where two agents -a seller and a buyer -are interested in trading a good.Both agents are characterized by a private valuation for the item, and their goal is to maximize their own utility.Solving this problem requires the design of a mechanism that intermediates between the two parties, facilitating the trade.Ideally, the mechanism should maximize efficiency (i.e., trade whenever the buyer's valuation exceeds the seller's one) while ensuring that agents behave according to their true preferences (incentive compatibility), and that the utility for participating in the mechanism of each agent is non-negative (individual rationality).These properties ensure favorable outcomes for the agents, yet they do not guarantee the economic viability of the mechanism.To see this, consider the following mechanism ℳ. ℳ asks the agents for their valuations,  for the seller and  for the buyer, and makes the trade happen if it is convenient (i.e., if  ≤ ).In case of a trade, ℳ then charges  to the buyer and pays  to the buyer.It is not hard to see that ℳ enforces incentive compatibility and individual rationality, and is efficient by design.However, it exhibits the major drawback of allowing the intermediary to incur a net loss when  > .To avoid such situations, a crucial constraint in bilateral trade is budget balance, which restricts the mechanism from subsidizing the agents.As highlighted by the above example, an incentive compatible mechanism maximizing efficiency for bilateral trade may not be budget balanced.This phenomenon was first observed by Vickrey [1961]; subsequently Myerson and Satterthwaite [1983], provided a more general impossibility result by showing the existence of instances where a fully efficient mechanism that satisfies incentive compatibility, individual rationality, and budget balance does not exist.This result holds even when probabilistic information on the agents' valuations is available.To circumvent these impossibility results, the extensive subsequent research primarily focuses on finding approximately efficient mechanisms in the Bayesian setting.There, various incentive compatible mechanisms exist that give a constant-factor approximation to the social welfare (see, e.g., Blumrosen and Dobzinski [2014], Kang et al. [2022], while more recent works also consider the harder problem of approximating the gain from trade [McAfee, 2008, Blumrosen and Mizrahi, 2016, Brustle et al., 2017, Deng et al., 2022, Fei, 2022].While the Bayesian assumption of having perfect knowledge about the underlying distributions of valuations is, in some sense, necessary for extracting meaningful approximations to the social welfare [Dütting et al., 2021], it is important to observe that this assumption is oftentimes unrealistic.Following the recent line of work initiated by Cesa-Bianchi et al. [2021], we study this fundamental mechanism design problem through the lens of regret minimization in a repeated setting where at each time , a new seller/buyer pair arrives.The seller arriving at time  has a private valuation   representing the lowest price they are willing to accept for the item.Analogously, the buyer has a private valuation   representing the highest price they are willing to pay for the item.The learner, without any knowledge about the private valuations at the current time , posts two (possibly randomized) prices:   to the seller and   to the buyer.A trade happens when both agents agree to trade, i.e., when   ≤   and   ≤   .After posting (  ,   ), the learner observes some feedback about the transaction, and is awarded the gain from trade: The goal of the learner is to maximize the overall gain from trade or, equivalently, minimize the regret with respect to the best price in hindsight.Prior research has investigated the impact of different budget balance notions on the problem's learnability.When the mechanism is constrained to enforce per-round strong budget balance (i.e.,   =   at each time step ), it is possible to attain sublinear regret only when the sequence of valuations is drawn i.i.d.from some fixed unknown
distribution, and the learner has either full feedback, or some stringent assumptions regarding the sequence of valuations are enforced.Specifically, in partial feedback regime, valuations have to be drawn i.i.d.from a smooth distribution, independently for the seller and the buyer [Cesa-Bianchi et al., 2021, Cesa-Bianchi et al., 2024].If the learner is only required to enforce (step-wise) weak budget balance (i.e.,   ≤   for each ), then Azar et al. [2022] provide a learning algorithm achieving sublinear 2-regret when the sequence of valuation is generated by an oblivious adversary.1They also show that this result is tight: no algorithm can achieve sublinear (2 − )-regret in the adversarial case, for any constant  > 0. In an attempt to overcome this barrier, Cesa-Bianchi et al. [2023] show that sublinear regret can be achieved beyond the i.i.d.stochastic setting, under the assumption that the adversary is constrained to choose randomized (possibly non-stationary) sequences of valuations that are not "too concentrated" (i.e., under a -smooth adversary model).Inspired by the positive results obtained in the literature by transitioning from strong to weak budget balance, we investigate the following natural open question: Is it possible to achieve sublinear regret against an oblivious adversary in the repeated bilateral trade problem under a realistic notion of budget balance?We answer this question positively by introducing global budget balance, where the learner is required to maintain budget balance only "overall".The idea behind global budget balance is to allow the learner to reinvest the profit gained in previous rounds (obtained by posting a lower price for the seller compared to the buyer), with the constraint that the learner cannot subsidize the market over the whole time horizon.Formally, a learning algorithm that posts prices ( 1 ,  1 ), ( 2 ,  2 ), . . . is global budget balanced if the following inequality holds almost surely:  =1 Profit  (  ,   ) ≥ 0. The profit Profit  (  ,   ) = I{  ≤   }I{  ≤   }(  −   ) is non-negative when   ≤   , and may drop below zero only by posting prices that are not step-wise budget balanced, i.e.,   >   .We argue that this constraint is more realistic than the restrictive notions of per-round budget balance.For instance, in contexts like ride-hailing platforms (such as Uber and Lyft), the platform might opt to forego some short-term profit to enhance other metrics, like the overall welfare of the system.
1The -regret measures the difference between the gain from trade of the best fixed price in hindsight and  times that of the algorithm (see e.g., Kakade et al. [2009]).

Overview of Our Results
We report here an overview of our results, we also refer to Table 1 for a comparison with the state of the art.In this paper we introduce the notion of global budget balance for the repeated bilateral trade problem, and provide the following results in terms of regret with respect to the best fixed price in hindsight in the adversarial case: • In the full feedback model, when the learner observes seller and buyer valuations after posting prices, we design a learning algorithm characterized by a Õ( 1 /2 ) regret upper bound (Theorem 4.2).We also prove that no learning algorithm can improve this bound by more than a poly-log  factor (Theorem 4.4).
• In the one-bit feedback model, where the learner can observe only whether the trade happened or not, we show that it is possible to guarantee a Õ( 3 /4 ) regret upper bound (Theorem 5.4).
Then, we provide an Ω( 5 7 ≈0.714 ) lower bound, which holds even in the two-bit feedback model, where the learner can observe which agent accepted and who declined the offered prices (Theorem 5.5).
These results demonstrate how the notion of global budget balance enables online learnability, allowing us to provide the first no-regret algorithms for repeated bilateral trade within an oblivious adversary framework, in contrast to the per-round approaches considered in previous works.Furthermore, the regret rates separate full feedback and the two partial feedback models (one or two bits).In partial feedback, the surprising lower bound of Ω( 5 /7 ), together with the ( 3 /4 ) upper bound, mark a clear separation between this problem and other partial feedback models (e.g., partial monitoring [Bartók et al., 2014] and online learning with feedback graph [Alon et al., 2017], where the minimax regret have been characterized to fall in one of three admissible rates: √ ,  2 /3 and ).This separation had already been hinted at in the special case of -smooth adversary by Cesa-Bianchi et al. [2023].Finally, inspired by work on bandits with knapsacks (see Section 1.3 for detailed references), we introduce a stronger learning benchmark: the best fixed feasible distribution over prices.Such benchmark is allowed to post prices that are not per-round budget balanced, but is global budget balanced in "expectation".
• We show that there exists a constant  0 > 0 such that it is impossible to achieve sublinear -regret against this benchmark for any  ∈ [1, 1 +  0 ) (Theorem 6.2).
• We prove that the best feasible distribution over prices collects at most twice the gain from trade extracted by the best fixed price in hindsight (Theorem 6.3).This implies the existence of algorithms with sublinear 2-regret against this new benchmark.
• We show that the multiplicative gap of 2 between the gain from trade attainable by the two different benchmarks is tight (Theorem 6.5).
First, we observe that the task of learning the best feasible distribution over prices is reminiscent of the problem of bandits with knapsacks in the presence of replenishment [Kumar and Kleinberg, 2022, Slivkins et al., 2023, Bernasconi et al., 2024a].In contrast to previous work, we consider the more challenging adversarial setting and provide learning algorithms with a competitive ratio that is an absolute constant.In the adversarial bandits with knapsacks literature, the only setting where sublinear Θ(1)-regret can be achieved is when the available budget is Ω() [Castiglioni et al., 2022], while in general the competitive ratio is (log ) [Immorlica et al., 2022].Second, the tight multiplicative gap of 2 between the two benchmarks suggests that to design a better learning algorithm with sublinear -regret with respect to the best feasible distribution (for  ∈ (1 +  0 , 2)), a more direct approach is needed.

Challenges and Techniques
The key aspects that distinguish bilateral trade from standard online learning models with full or bandit feedback can be identified in two main features: the action space and the challenging partial feedback structure.The applicability of previous results to our model is significantly limited due to adversarial input sequences and the need to handle the global budget balance constraint effectively.
Action space.The action space is continuous and bidimensional (prices belong to [0, 1] 2 ), and neither the gain from trade nor the profit functions are continuous in the prices posted.This makes it challenging to discretize the space with a finite grid  such that the best prices in  perform similarly to the best prices in [0, 1] 2 , and such that grid  is small enough that it is possible to learn in an online way its best pair of prices.In the absence of any probabilistic or smoothness assumption on the adversary, we cannot rely on a "smoothing trick" to induce regularity on the expected gain from trade, as in previous works [Cesa-Bianchi et al., 2023].
Partial Feedback.Partial feedback models for bilateral trade are inherently challenging.The one-bit feedback model only informs the learner on whether the trade happened or not, which is significantly less informative than the traditional bandit feedback model, since the learner cannot even reconstruct the gain from trade received for the specific prices it posted.For example, if the learner posts price 1 /2 to both agents, and they accept the trade, there is no way of distinguishing between the case in which the gain from trade is constant (e.g., valuations are (0, 1)) from the case in which the gain from trade is arbitrarily small (e.g., valuations are ( 1 /2 − , 1 /2 + ) for some small ).
On the other hand, if one of the two agents rejects the trade, then the learner can only infer loose bounds on the valuations.
Gain from Trade vs. Profit trade-off.Global budget balance requires that the cumulative sum of profits at the end of the time horizon must be greater than or equal to 0. Therefore, the learner has to maximize its cumulative gain from trade, while accumulating enough profit to enforce global budget balance.Balancing this trade-off is a complex task due to the different nature of the two objectives: gain from trade is maximized by setting identical prices for both agents, whereas profit is maximized by selecting prices that are "far from each other".To see this, consider an instance where valuations are either (  ,   ) = (0, 1) or (  ,   ) = ( 1 /2 − , 1 /2 + ) with equal probability, for some small  > 0. To achieve maximum expected profit, the learner would always set the price at 0 for the seller and 1 for the buyer.On the other hand, to maximize the expected gain from trade, the learner would always offer 1 /2 to both agents.
Our Two-Phase Approach.Our learning algorithms follow a two-phase approach, initially focusing on maximizing profit through a carefully designed multiplicative grid   of candidate prices and then switching to maximizing gain from trade on a different (additive) grid   of non-budgetbalanced prices.At a high level, the first phase is used to collect budget, which can be subsequently reinvested in the second phase.This poses several challenges due to the non-stationary nature of the adversary.The pairs of prices in   , which are not per-round budget balanced, enable the algorithm to circumvent the negative results that hinder discretization in scenarios with per-round budget balance (see, e.g., , the "needle in a haystack" phenomenon in Theorem 7 of Cesa-Bianchi et al. [2024]).The multiplicative nature of the grid   is crucial in ensuring that the gain from trade accrued by the algorithm during the first phase does not yield too much regret.This last result is surprising since, in the first phase, the learning algorithm is maximizing profit, an objective that is inherently orthogonal to the gain from trade.Finally, the scarcity of feedback in the one-bit feedback model is addressed via a carefully designed estimation technique that allows the learner to estimate the gain from trade in one point of the grid   posting two different prices.In contrast to the technique by Azar et al. [2022], our procedure is "asymmetric" in how it deals with the seller and buyer, and it provides biased estimates.
Lower bounds.Besides the typical challenges in proving lower bounds for repeated bilateral trade with respect to the best fixed price in hindsight, in our model the agent is allowed to post prices that are not per-round budget balanced (i.e., it may be the case that   >   ).This considerably complicates the construction of the hard instances, as any algorithm could sacrifice temporarily some profit by posting prices with   >   to extract a large gain from trade (that the fixed price benchmark may not be able to obtain).To deter this kind of behavior, we incorporate into the hard instances certain unfavorable trade opportunities that dissuade the learner from setting prices that are not budget balanced.This additional complication comes at some cost: in the partial (two-bit) feedback model we recover a lower bound of Ω( 5 /7 ), whereas the corresponding lower bound by Cesa-Bianchi et al. [2023] is Ω( 3 /4 ).
Partial feedback.Repeated bilateral trade naturally involves challenges due to partial feedback.Therefore, our work aligns with the research that explores online learning with feedback models beyond the conventional full feedback and bandit models.Our one-and two-bit feedback models share similarities with graph-structured feedback [Alon et al., 2017] and with the partial monitoring framework [Cesa-Bianchi et al., 2006, Bartók et al., 2014].
Bandits with knapsacks.Another related line of work is that of online learning under long-term constraints.Some works study the case of static constraints and develop projection-free algorithms with sublinear regret and constraint violations [Mahdavi et al., 2012, Jenatton et al., 2016], while others study the case of time-varying constraints [Mannor et al., 2009, Yu et al., 2017, Sun et al., 2017].Badanidiyuru et al. [2018] introduced and solved the (stochastic) bandits with knapsacks (BwK) framework, in which they consider bandit feedback and stochastic objective and cost functions.In this model, the learner's objective is to maximize utility while guaranteeing that, for each of the  available resources, cumulative costs are below a certain budget .Other optimal algorithms for stochastic BwK were proposed by Agrawal and Devanur [2019], Immorlica et al. [2022].The setting with adversarial inputs was first studied in Immorlica et al. [2022], where the baseline considered is the best fixed distribution over arms.Achieving no-regret is not possible under this baseline and, therefore, they provide no--regret guarantees for their algorithm.If we denote by  the per-iteration budget of the learner, the best-known guarantees on the competitive ratio  are 1 / in the case in which  = Ω() [Castiglioni et al., 2022], and (log  log ) in the general case [Kesselheim and Singla, 2020].When considering a benchmark similar to the adversarial BwK Feedback   is revealed to the learner scenario, we show that our algorithm ensures a  = 2 guarantee.Kumar and Kleinberg [2022] recently proposed a generalization of the stochastic BwK model in which resource consumption can be non-monotonic; that is, resources can be replenished or renewed over time.Our model also admits replenishment.It should be noted that, in our setting, directly utilizing techniques from BwK is not feasible due to the complex continuous action space and the limited availability of feedback, which is less informative compared to traditional bandit feedback.

Repeated Bilateral Trade
We study repeated bilateral trade problem in an online learning setting, where the learner has to enforce global budget balance and the sequence of valuations is generated by an oblivious adversary.
The learning protocol.The learner repeatedly interacts with the environment according to the following protocol (see also pseudocode).At each time step , a new pair of buyer and seller arrives, characterized by valuations   ∈ [0, 1] and   ∈ [0, 1], respectively.Without knowing   and   , the learner posts two prices:   ∈ [0, 1] to the seller, and   ∈ [0, 1] to the buyer.If both the seller and the buyer accept (i.e.,   ≤   and   ≤   ), then the learner is awarded the gain from trade that corresponds to the increase in social welfare generated by the trade.To simplify the notation, we omit the second argument of GFT  (and of Profit  ) when the same price is posted to both agents.After posting the prices, the learner does not observe directly the gain from trade or the valuations, but receives some feedback   .

Global budget balance.
For each time step , the notion of profit of the learner is naturally defined: if the agents accept prices   and   , then the learner receives a net profit of   −   ∈ [−1, 1].Unlike the case of the gain from trade, the learner naturally knows its profit at the end of each time step, as it sets the prices and always observes whether the trade occurred.The learner maintains a budget   , which is initially 0 ( 0 = 0) and is updated at each time step according to the profit generated or consumed:   ←  −1 + Profit  (  ,   ).We restrict the learner to enforce a global budget balance property which states that the final budget   has to be non-negative with probability 1.In practice, we require the learner to always post prices   ,   such that (  −   ) ≤  −1 .2 Feedback models.In this paper, we study three feedback models, that we list here in increasing order of intricacy: • Full feedback: at the end of each round, the agents reveal their valuations (i.e.,   = (  ,   )).
• Two-bit feedback: the agents only reveal their willingness to accept the prices offered by the learner (i.e.,   is composed by the two bits (I{  ≤   }, I{  ≤   })) • One-bit feedback: the learner only observes whether the trade happened or not (i.e.,   = These feedback models are not only interesting from the theoretical learning perspective, but they are also well motivated in terms of practical applications.The full-feedback model can be used to describe sealed-bid-type auctions, while the two partial feedback settings (one-and two-bit) enforce the desirable property (for the agents) of revealing a minimal amount of information to the learner.
Regret with respect to the best fixed price.The goal is to maximize the total gain from trade on a fixed and known time horizon  while enforcing the global budget balance condition.Following the literature on repeated bilateral trade [Cesa-Bianchi et al., 2021], we measure the performance of a learning algorithm in terms of its regret with respect to the best fixed price(s) in hindsight.For any learning algorithm A and sequence of valuations  = {(  ,   )}  =1 we define: where the sequence  induces the GFT  functions and the expectation is with respect to (possibly) randomized prices   and   generated by the learning algorithm A. One simple property that follows immediately by definition is that, for any sequence of valuations, there exists a fixed pair of identical prices that maximizes the gain from trade.This means that the notion of "best price in hindsight" is well defined, and confirms the intuition that posting two different prices only helps during learning, but does not impact the maximization of gain from trade in hindsight.Finally, we define the regret of an algorithm A (without the dependence on a specific sequence of valuations) as its worst-case performance:   (A) = sup    (A, ), where the sup is over the set of all the possible sequences of  pairs of valuations.
A stronger benchmark: the best feasible distribution over prices.In this paper we also introduce a new (stronger) benchmark for the study of repeated bilateral trade: the best fixed budget-feasible distribution over prices.This benchmark captures the flexibility of the global budget balance condition, and it arises naturally from the literature on bandits with knapsacks.Before proceeding with the definition, let Δ([0, 1] 2 ) be the family of all the probability measures over the measurable space , where ℬ denotes the Borel -algebra.
Definition 2.1 (Best feasible distribution).For any sequence  of seller's and buyer's valuations, we define the best fixed budget-feasible distribution over prices as the solution of: sup where E (,)∼ denotes that the expectation is with respect to prices (, ) sampled according to .
This definition is well posed and there exist optimal distributions whose support contains either one or two pairs of prices.For a formal proof of this fact we refer to Proposition A.3 in Appendix A.1.

Price Discretizations and Two-Phase Algorithm
In this section we present our two-phase meta algorithm, preceeded by two key results on how to discretize the price space in a way that ensures certain essential properties about profit and gain from trade.First, in Section 3.1 we prove that the gain from trade of the best fixed price in hindsight is close to that of the best pair of (non-budget-balanced) prices on a suitable "additive" grid.Second, in Section 3.2 we construct an hybrid "multiplicative-additive" grid in which each interval of a one-dimensional additive grid is further divided into sub-intervals with geometrically decreasing length.This grid has the surprising property that the profit of the best fixed pair of prices on it is close to the gain from trade generated by the best fixed price in the [0, 1] interval, up to a poly-logarithmic multiplicative factor.Finally, we introduce our two-phase learning via the meta-algorithm GFT-Max.

Additive Grid for Gain from Trade
For any integer , we denote by Similarly, we denote with   = {( +1 /,  /) :  ∈ {0, 1, . . .,  − 1}} the set of pairs formed by contiguous points in the -uniform grid such that the first element of the pair is greater than the second.This latter grid can be proved to enjoy the desirable property of well-approximating the gain from trade of the best fixed price, while violating the global budget balance condition by a small amount.The argument behind the approximation guarantee is simple: if  * is the best fixed price in hindsight, then the pair of prices ( (+1) /,  /) such that  * belongs to the interval [  /, (+1) /] are nearly as good as  * .We have the following result.
Then, by summing up the gain from trade obtained by posting ( ( * +1) /,  * /), we immediately obtain the first part of the statement by applying at each  either case () or ().The second part of the statement follows from the observation that the per-round deficit for posting prices ( ( * +1) /,  * /) is at most 1 /.This concludes the proof.

Multiplicative Grid for Profit
For any , we construct the two-dimensional grid   starting from the points on the one-dimensional grid   .For each  ∈   , we add to   points of the form ( − 2 − , ) and (,  + 2 − ), for  = 0, 1, . . ., ⌈log ⌉ so that they define intervals of geometrically decreasing length to the left and upward of (, ).Formally, we define   as the union of  −  and  +  (see also Figure 1): The additive-multiplicative nature of   endows it with two crucial properties: (i) its cardinality is ( log ) an thus only depends linearly in  and (ii) the profit of the best prices in   is at least a (log ) fraction of the GFT at the best fixed price in [0, 1], up to an additive factor of (  /).Proposition 3.3.For any  and sequence of valuations, we have: Proof.Fix the sequence  of valuations and let  * be the price maximizing the gain from trade in   .We have the following chain of inequalities: max We bound separately the first and second term of the right-hand side of the inequality.Starting with (  −  * )I{  ≤  * ≤   }, we can rewrite the expression through a case analysis depending on the interval of the discretization in which   is located.For each time step , we have where we used the fact that Let   be the number of time steps satisfying the condition To obtain Equation ( 5) we use that, for any  ∈ {0, . . ., ⌈log ⌉}, if   > 0 then it must be the case that  * + 2 − ∈ [0, 1].Therefore, for any , it it possible to obtain a profit of 2 − by posting the pair ( * ,  * + 2 − ), which is guaranteed to belong to   since  * ∈   by construction, and A similar argument can be carried over for the other term of Equation ( 3), yielding: Finally, we plug Equations ( 5) and ( 6) into Equation (3), and use  ≤  to conclude the proof.□

Our Two-Phase Meta-Algorithm: GFT-Max
We describe our two-phase learning approach by presenting the meta-algorithm GFT-Max.For details we refer to the pseudocode.The algorithm takes in input a budget threshold  and an integer  (which induces the two grids   and   ), and employs two regret minimizers-A P for the profit and A G for the gain from trade-as internal routines.In the first phase (Line 1), the algorithm uses function Profit-Max to maximize profit until the collected budget reaches a given threshold .This is achieved by running a regret minimizer A P over the set   of pairs of prices (see Section 3.2) using profit as objective.Then, in the second phase (from Line 2 onward), the algorithm exploits a regret minimizer A G to maximize the gain from trade over the grid   , whose prices which are "almost budget-balanced" and consume only a small fraction of the previously acquired budget (see Proposition 3.1).In Section 4 and Section 5 we provide regret upper bounds for this meta-algorithm in the full and one-bit feedback model, respectively.The budget threshold , the regret minimizers, and the grid parameter  are tuned according to the specific case considered.

Full Feedback
We start by studying the full feedback input model where the agents reveal their valuations (  ,   ) at the end of each time step .Here, the learner has counterfactual information regarding all the prices they could have posted, independently of the pair of prices actually posted at time .In Section 4.1, we first present a two-phase learning algorithm (GFT-Max) which guarantees Õ( √ ) regret with respect to the best fixed price in hindsight.In Section 4.2 we complement this result by proving that this is tight, up to poly-logarithmic terms.We start the analysis by looking at the first phase of GFT-Max, Profit-Max (reported as a function in the pseudocode of GFT-Max).We employ the Hedge algorithm (see, e.g., Section 5.3 of Slivkins [2019]) as the regret minimizer A P , which is used on the action space of the prices in   .As a first step, we note that the gain from trade of any fixed price in the first phase (which terminates at the stopping time ) is not too large.
Lemma 4.1.Consider Profit-Max with budget threshold , grid   , and learning algorithm Hedge as A P .
Then, with probability at least 1 − 1 /, we have max Proof.We start by observing that, by Proposition 3.3, there exists a pair of prices ( * ,  * ) ∈   such that max Hedge maintains a distribution   ∈ Δ(  ) at each  ∈ [], and such distributions guarantees that the expected regret is (  log(|  |)) [Slivkins, 2019].In particular, given  ∈ [], we have By applying the Azuma-Hoeffding inequality for each round  ∈ [], and union bounding over the possible stopping times, we get that with probability at least 1 − 1 /, we can write the following also for the stopping time : This yields the following chain of inequalities max This concludes the proof.

□
Lemma 4.1 helps us bounding the regret of GFT-Max up to the (random) time step , when the algorithm switches from profit to gain from trade maximization.Setting  = √  and  = √ , and using Hedge as regret minimizer also in the second phase, yields the following result.

Theorem 4.2. Consider the repeated bilateral trade problem in the full feedback model. There exists a learning algorithm A that respects global budget balance and whose regret with respect to the best fixed price in hindsight verifies
(A) ≤ 92 log 3 /2 () √ .
Proof.We prove that algorithm GFT-Max with the proper choice of budget , grids   and   , and algorithms A P and A G achieves the desired regret bound, while enforcing global budget balance.First, we show that the algorithm enforces global budget balance for any value of the stopping time . By construction, the profit at time  (i.e., right after the end of first phase in which we employ the subroutine Profit-Max) is at least .Moreover, in each round  ∈ { + 1, . . ., } of the second phase, the profit is at least − 1 /.Hence, the cumulative profit at time  is at least Then, we prove the upper bound on the cumulative regret.We start by considering the regret accumulated in the interval { + 1, . . ., }.In particular, for any  ∈ [], we have max where the first inequality follows from Proposition 3.1, and the second inequality follows from the regret bound of Hedge when the range of the rewards is [− 1 /, 1] and   is the probability distribution over the action set maintained by Hedge (instantiated to maximize gain from trade in the second phase).Then, assume that the bound in Lemma 4.1 holds, which happens with probability at least 1 − 1 /.By employing Equation ( 7) and Lemma 4.1 we can show that, with probability at least 1 where the second inequality follows from Lemma 4.1, and the third one from Equation ( 7).Then, by substituting GFT  (  ,   ) + 90 log 3 /2 () √ , By rearranging we have that, with probability at least 1 − 1 /, it holds max where the first inequality follows from the fact that the gain from trade is always non-negative.Finally, we can conclude that the expected regret is at most This concludes the proof.

Ω( √ 𝑇) Lower Bound with Full Feedback
We present a lower bound that shows how the regret rate in Theorem 4.2 is optimal up to polylogarithmic factors.The lower bound is based on the following stochastic sequence: at each time step  the pair (  ,   ) is drawn uniformly at random between 3 pairs of valuations: (0, 1 /4), ( 3 /4, 1) and ( 3 /4, 1 /4).These three points naturally partition the [0, 1] 2 square into four regions (see Figure 2).Crucially, prices in the [ 3 /4, 1] × [0, 1 /3] region (green in Figure 2) incur in negative expected gain from trade, while prices in the [0, 3 /4) × ( 1 /3, 1] region (white in Figure 2) miss all trades.Therefore, the only reasonable option for any learner is to post prices in the two remaining regions (orange in Figure 2), with an expected gain from trade of 1 /12.This allows for a reduction to an expert problem with 2 available actions (one for each of the two orange regions).This construction highlights a key difficulty if compared to lower bounds for per-round budget balanced algorithms: we need to disincentivize the learner from choosing non budget balanced prices below the diagonal.We have the following Theorem, which is preceded by a preliminary Lemma.
Lemma 4.3.Let   be a symmetric random walk on the line after  steps, starting from 0. Then, for  large Proof.It is well known that the expected distance of a random walk from the origin grows like Θ( √ ).Formally, the following asymptotic result holds (see, e.g., Palacios [2008]): Observe that 2 / > 2 /3.Proof.We prove this result via Yao's principle [Yao, 1977].We apply the easy direction of the theorem, which reads (using our terminology) as follows: the regret   (A) of a randomized learner A against the worst-case valuations sequence is at least the regret of the optimal deterministic learner  against a stochastic sequence of valuations .Formally, where the expectation is with respect to the stochastic valuation sequence , while  denotes deterministic learner that posts the (  ,   ) prices.In particular, we construct a randomized instance  such that any deterministic learning algorithm must suffer, in expectation with respect to the randomness of , at least  √  regret for some constant .The randomized instance is constructed as follows: at each time step  ∈ [] the adversary selects uniformly and independently at random one of the following three points (0, 1 /4), ( 3 /4, 1) and ( 3 /4, 1 /4).We first compute a lower bound on the expected gain from trade achieved by the best fixed price in hindsight, and then we provide an upper bound on the expected gain from trade which can be attained by any deterministic learning algorithm.Combining these two intermediate results will yield the statement via Yao's principle.Let  0 be a random variable denoting the number of times that ( 3 /4, 1 /4) is realized.Analogously, let  1 (resp.,  2 ), be the number of times in which (0, 1 /4) (resp., ( 3 /4, 1)) is realized.Clearly,  0 +  1 +  2 = , and E[  ] =  /3 for any  = 0, 1, 2. Conditioning on  0 , the remaining  −  0 valuations are either (0, 1 /4) or ( 3 /4, 1), sampled uniformly and independently at random.Then, we have that where Equation ( 8) follows by considering a symmetric random walk on a line on  −  0 steps that goes left when (  ,   ) = (0, 1 /4), and goes right when (  ,   ) = ( 3 /4, 1).Now, we can take the expectation (with respect to  0 ) on the first and last term of the previous chain of inequalities to get where the last line follows from Markov's inequality.Now, we construct an upper bound on the gain from trade achievable by any deterministic learning algorithm (even without the constraint of enforcing global budget balance).Consider what happens at each fixed time steps : the history of the realized valuations up to that point induce deterministically the pair of prices (  ,   ) posted by the learning algorithm.We prove now that no matter (  ,   ) chosen, the learner does not achieve more than an expected gain from trade of 1 /12.
Therefore, no matter what the learner does, it gets expected gain from trade at most  /12: We can conclude the proof of the Theorem by combining Equation ( 9) and Equation ( 10) to get: where that the randomness is with respect to the sequence generated by the randomized adversary.This concludes the proof. □

Partial Feedback
In this section, we study the more challenging partial feedback models.In Section 5.1, we provide a positive result for the case of one-bit feedback (  = I{  ≤   } • I{  ≤   }), where the learner only observes whether the trade happened or not.In particular, we show that GFT-Max, with a suitable initialization, achieves a regret of the order Õ( 3 /4 ).Differently from the full-information setting, the design of a no-regret algorithm for the gain from trade (i.e., A G ) is particularly challenging as we need to build an estimator for the gain from trade by only playing non-budget balanced prices in   .In Section 5.2 we complement the regret upper bound by proving that every algorithm has regret at least Ω( 5 /7 ), even with two-bit feedback (  = (I{  ≤   }, I{  ≤   })), i.e., where each agent separately reveal their willingness to accept the prices posted.One of the main challenges posed by such a lower bound resides in handling non-budget balanced prices, as any algorithm could temporarily sacrifice some profit while collecting large GFT.
5.1 Õ( 3 /4 ) Upper Bound with One-Bit Feedback We show how to employ GFT-Max with a suitable choice of parameters  and , and regret minimizers A P and A G to achieve the desired regret bound.Section 5.1.1 presents a regretminimizing algorithm that can be employed as A P , while Section 5.1.2provides a suitable regret minimizer to be employed as A G .Finally, in Section 5.1.3,we present the final regret upper bound.

Regret Minimizer for Profit under Partial Feedback
As in the full-information setting, we exploit Profit-Max to maximize the profit until the accrued budget is at least a given threshold .In particular, we instantiate the subroutine Profit-Max with EXP3.P [Auer et al., 2002] as regret minimizer A P and grid   .The following lemma shows that the gain from trade of any fixed price  in the first phase is small enough up to the stopping time  that terminates the first phase.
Lemma This concludes the proof. □

Regret Minimizer for Gain from Trade under Partial Feedback
A crucial ingredient we need is an estimation procedure capable of extracting quantitative information from the gain from trade, having only access to one bit of feedback.More precisely, we need an estimation procedure of the gain from trade function   ∋ (, ) ↦ → GFT  (, ).A similar challenge is faced in Azar et al. [2022], where the action set consists of a discretization of a single price (i.e., their estimation procedure posts  to both seller and buyer).However, in our scenario, such symmetry no longer applies.Here, we must consider the grid   , which employs distinct prices for the seller and the buyer ( + 1 / and , respectively).Thus, our estimation procedure GFT-Est has an asymmetric structure (see the pseudocode, in particular Lines 17 and 20).First, GFT-Est draws a sample from a Bernoulli distribution with parameter ( +1)/( +1) (Line 15).If the result is 1, it posts price  to the buyer, and the seller receives a price drawn uniformly at random from [0,  + 1 /] (Line 17).Otherwise, if the result is 0, GFT-Est posts price  to the seller, and the buyer's price is drawn uniformly at random from [, 1].We denote the final estimate at  by GFT  ( + 1 /, ) (Line 20).Overall, our estimator has a small bias, as formalized in the following Lemma.
We can thus conclude the proof by observing that: where the last inequality holds since  − + 1+ ≤ 2 for all  ∈ [−1, 1] and  < 1.

□
Given the estimation procedure GFT-Est, it is possible to turn any no-regret algorithm for the full-feedback setting into a regret minimizer for the partial feedback setting by the standard block decomposition technique (see, e.g., Chapter 4 of Nisan et al. [2007]).The procedure, which we call Block-Decomposition is described in the pseudocode.We assume to employ Hedge as the full-feedback regret minimizer A.
Block-Decomposition works by subdividing the time horizon  into  blocks ℬ 1 , . . ., ℬ  of equal size and contiguous, that is ℬ  = {  / + 1, . . ., ( + 1)  /} for any  ∈ {0, 1, . . .,  − 1}.In each block we select uniformly at random  time steps (i.e., one for each pair in   ), and we randomly assign each of such time steps to one pair of prices in   .Formally, for each block , we have a one-to-one map ℎ  : ℬ  →   which is a uniform random map from prices in   to rounds in block ℬ  .We call the image of ℎ  the exploration rounds, and we denote the set of such rounds by   .
For any block , the algorithm builds a vector r such that the entry r (, ) is an estimation of the reward of the pair (, ) ∈   in block ℬ  .To do that, for any block  and pair of prices (, ) ∈   , we let r (, ) = GFT  (, ), where  = ℎ  (, ) and GFT  (, ) is computed through the estimation procedure GFT-Est with prices (, ) (Lines 10 and 11).For any block , exploration rounds in   are used to build r .In all the other rounds in ℬ  \   the algorithm plays according to the strategy x  ∈ Δ(  ) (Line 8) computed by A at the beginning of block  (Line 5).At the end of each block , the full-information subroutine A is updated using r (Line 13).
Let GFT  (, ) = ∈ℬ  GFT  (, )/|ℬ  | be the average GFT over block ℬ  .Since we choose exploration rounds uniformly at random throughout block ℬ  we have that, for any (, ) ∈   , where the last equality follows from Lemma 5.2.This yields the following guarantees on the regret of Block-Decomposition.Let x  be the distribution over   employed to sample (, ) at time .At time ,  ∈ ℬ  , we have Proof.Let    be the regret accumulated by Hedge over  rounds when it observes utilities in [0, 1] and plays over  actions.Each exploration round can cost at most 1 with respect to playing according to x  , and there are   such rounds.Then, we have that It is known that    ≤ 4  log  (see, e.g., Slivkins [2019]).Then, by setting  =  1/4 and  =  1/2 we obtain sup where the last inequality holds for all  ≥ 2. This concludes the proof.□

Putting Everything Together
GFT-Max with the two regret minimizers described in Sections 5.1.1 and 5.1.2guarantees a ( 3 /4 ) bound on the regret.

Theorem 5.4. Consider the repeated bilateral trade problem in the one-bit feedback model. There exists a learning algorithm A that respects global budget balance and whose regret with respect to the best fixed price in hindsight verifies:
(A) ≤ 1282 •  3 /4 log 2 .
Proof.The proof follows the same structure of Theorem 4.2.In this case, we set  =  3 /4 and  =  1 /4 , and consider GFT-Max with EXP3.P [Auer et al., 2002]  where the second inequality follows from Lemma 5.1 and the third one from Lemma 5.3.Then, by substituting  =  3 /4 and  =  1 /4 we obtain max Then, by rearranging, with probability at least 1 − 1 / it holds max where the first inequality follows from the fact that the gain from trade is always non-negative.Finally, the expected regret is at most This concludes the proof.
□ 5.2 Ω( 5 /7 ) Lower Bound with Two-Bit Feedback In this section, we provide a lower bound for learning the best price against any oblivious adversary, with global budget balance constraints and two-bit feedback.Our construction builds upon the one by Cesa-Bianchi et al. [2023], but exhibits two key differences.First, we are not constrained to use smooth value distributions.This allows us to simplify the construction, avoiding the reduction to online learning with feedback graphs.Second, we only require algorithms to be globally budget balanced (instead of per-round weakly budget balanced); looser budget balance constraints enhance the capabilities of the learning algorithm.All in all, we derive a lower bound that is slightly looser  5 /7 ≈  0.714 compared to the Ω( 3 /4 ).We further elaborate on this comparison at the end of the Section.
Theorem 5.5.Consider the problem of repeated bilateral trade in the two-bit feedback model.Any learning algorithm that satisfies global budget balance suffers regret at least Ω( 5 /7 ).
The rest of the Section is devoted to the proof of Theorem 5.5; for the missing details, we refer to Appendix A.2.Our lower bound construction is based on  stochastic sequences of valuations.Each one of these sequences is sampled in an i.i.d.way from distributions of valuations with two key properties: (i) they are close with respect to some statistical measure of distance (see Lemma 5.11) and (ii) ensure that any pair of prices that reveals information on the underlying instance is highly suboptimal in terms of GFT (i.e., gathering information is "costly", see Lemma 5.8).We proceed in 5 steps.
i) Building a set of hard instances.We start by introducing a set of  = (), to be specified later, hard instances of the bilateral trade problem.Our goal is to show that any learning algorithm has regret at least Ω( 5 /7 ) in at least one of the  instances.We define a distribution   ∈ Δ([0, 1] 2 ) of valuations (, ) over [0, 1] 2 for each  ∈ {0, . . .,  − 1}, where we have  − 1 "perturbed" distributions corresponding to indices  ∈ {1, . . .,  − 1}, and a "base" distribution corresponding to  = 0. Let ℓ = 1 /12 and let Δ = ℓ /(−1), and  = Δ /2.Then, for any instance  ∈ {0, . . .,  − 1}, the distributions   are supported on the same set  of finitely many valuations.We describe the set  by partitioning it into six different sets.An illustration of the valuations set can be found in Figure 3a.First, we define the two sets  1 and  2 (respectively red and blue in Figure 3a) as follows: where  = 1 /32.These valuations are "balanced out" by the  valuations in  3 (green in Figure 3a):  3 =   3 = 0, 1−ℓ 2 −  + Δ :  = 0, . . .,  − 1 .Moreover, we have a set  4 of "deficit-generating" valuations (brown in Figure 3a) . .,  − 1 , and a single valuation belonging to  5 (orange in Figure 3a) We conclude by defining the set  6 (purple in Figure 3a) of the four "extremal" valuations (in practise, they are needed for Lemma 5.11 to hold): We assign different probabilities to the valuations in each set   depending on the instance.In particular, for any instance  ∈ {1, . . .,  − 1} with distribution   , we have that while we perturb by  the probability of the following valuations: Figure 3: Figure 3a represents the valuations support  of the instances distributions   , while Figure 3b represents the value of posting the same price to the seller and the buyer in instance   .
ii) Analysis of the gain from trade.As a first step, we argue that we can focus on algorithms that play only actions in   , without loss of generality.Consider infact any instance  ∈ {1, . . .,  } and any algorithm A. Similarly to the proof of Proposition A.3 (more specifically Claim A.4 therein), one can easily prove that there exists an equivalent algorithm A ′ (in terms of both feedback, GFT, and profit), that only has distribution supported on the grid   generated by the valuations .
Lemma 5.6.For any instance  ∈ {1, . . .,  − 1}, we have that: The previous lemma characterizes the optimal fixed budget balanced price.Then, we show that all the strategies that are not budget balanced are dominated.Indeed, one of the main challenges of our reduction is that, in general, a globally budget balanced algorithm could get a larger GFT by temporarily sacrificing some profit and posting prices (, ) with  < .In the following lemma we show that our instances are built in such a way that these strategies are dominated and thus can be discarded.Intuitively, every tuple of prices ,  that tries to gain higher GFT than the one obtained by playing on the diagonal must win also trades in  4 .Then, since trades in  4 have negative GFT and happen with sufficiently high probability  4 , we have that posting prices  <  is dominated.

𝒲
we have that Intuitively, the previous lemma shows that exploring is costly.Indeed, as we show in the following paragraph, the algorithm must post  ≥ (1+ℓ ) /2 to gain information on the instance, i.e., on the  that determines the instance.
iii) Analysis of the feedback.In the two-bit feedback model, for a valuation (, ) we have that posting prices (, ) generates the feedback (I{ ≤ }, I{ ≤ }).Now, we show that for any instance   and any posted prices (, ), the distribution of the feedback is independent on the instance almost everywhere.Specifically, the feedback distribution depends on the instance  only within a "small" and instance-dependent region of prices.For every instance  ∈ {1, . . .,  − 1}, let It is a simple exercise to see that, for each pair of prices outside the sets ℱ  , the feedback received by the learner is independent of the specific instance that is generating the valuations (see [Cesa-Bianchi et al., 2023, Claim 2] for a similar result).
Lemma 5.9.For all (, ) ∈ [0, 1] 2 \  ′ ∈{1,...,−1} ℱ  ′ it holds: iv) Price regions.The properties uncovered so far naturally partition the square [0, 1] 2 into the following three regions: • Exploration regions.We have the  − 1 regions ℱ  .These are the regions in which the probability of observing a certain two-bit feedback depends on the instance   from which the valuations are sampled.
• Exploitation regions.We define the regions ℰ  for any  ∈ {1, . . .,  − 1} as follows All these regions are such that the GFT collected by posting (, ) ∈ ℰ  is close (and smaller than or equal to) to the optimal GFT, i.e., the one obtained by posting ( *  ,  *  + ).• Dominated regions.We define  as the remaining set of possible valuations, that is It's easy to verify that by posting (, ) ∈  one obtains a GFT that is at most  1 .
Figure 4 shows the partition of the square [0, 1] 2 into exploration, exploitation and dominated figures, which are depicted in red, orange and green, respectively.Next, we define which are the number of times an algorithm plays in the exploration, exploitation and dominated regions, respectively.Then, we can upper bound the gain from trade of an algorithm A considering only the number of times A plays in each region.In particular, it holds that in any instance : • Cost of exploration: the GFT collected by posting prices in ℱ  is at most  2 for all  (Lemma 5.8); • Exploitation: the GFT collected by posting prices in ℰ  is at most  1 +  • I{ = } (Lemma 5.7); • Cost of domination: the GFT collected by posting prices in  is at most  1 (Lemma 5.7).
Formally, these observations lead to the following upper bound.
Relating the algorithm behavior on different instances.Now we relate the expected number of exploitation rounds ℳ  in different instances .This difference depends on the probability measures P  and P 0 through the Pinsker's inequality on a suitably defined multinomial random variable that encodes the four possible feedback observed when playing in the exploration regions ℰ  .
Then, combining all the previous results leads to the following lemma which gives a lower bound in terms of , , and .
By using Lemma 5.12 we can readily conclude the proof of Theorem 5.5 as follows.Let  =  − and  =   , with ,  > 0. Now we simply have to optimize over the choice of parameters  and .In doing so, we need to take into account the additional constraints necessary to have well-defined instance distributions   .In particular, we have that  ≤  1 from Equation ( 13), and 2  1 < 1 from Equation ( 12).Moreover, we also need to impose  4  < 1 by Equation ( 14).Since  4 = 4 1 (14 − 13), this also implies that  1 < 1 /4(13−14) < 1 / 2 for  > 2. Therefore, the constraint  <  1 implies:  =  − ≤ 1 / 2 = 1 / 2 which yields that  ≥ 2.Note that this dominates the constraint  < 1 / (or equivalently written as  ≥ ) that would have been implied by Equation ( 13 Connection with the Ω( 3 /4 ) lower bound of Cesa-Bianchi et al. [2023].While our result and the one of Cesa-Bianchi et al. [2023] build on a similar constructions (at least conceptually), we obtain a weaker lower bound.The main reason is that the learner in Cesa-Bianchi et al. [2023] is weak budget balanced, while in our work the learner has only a global budget balance constraint.To preclude this option to the learner, we penalize the GFT of prices in the lower triangle by adding the set of valuations  4 .If  4 is large enough w.r.t. 1 , then posting prices in the lower triangle is dominated.In particular, we must choose  4 = Θ( 1 ) as we prove in Lemma 5.7.Once we prove that the lower triangle is dominated, we can conceptually reduce our problem to the one of Cesa-Bianchi et al. [2023].However, the choice of  4 = Θ( 1 ) imposes the additional constraint  ≥ 2, which is not needed in the original construction.Hence, they can set  =  = 1 /4, and get a bound of Ω( 3 /4 ).This difference is depicted in Figure 5.

Best Feasible Distribution of Prices
In this section, we analyse the regret with respect to the best fixed distribution over prices which satisfies global budget balance on average.First, we present a negative result that clearly separates this new benchmark from the best fixed price in hindsight: in Theorem 6.2, we prove that it is impossible to achieve sublinear (1 + )-regret with respect to the best feasible distribution, even in the full feedback setting.On the positive side, we show that the two benchmarks are only a multiplicative factor 2 apart (Theorem 6.3).This implies that any learning algorithm that exhibits sublinear regret with respect to the best fixed price in hindsight automatically achieves sublinear 2-regret with respect to the best feasible distribution.Finally, we complement this positive result by proving that this multiplicative gap of 2 is tight (Theorem 6.5).

Linear Lower Bound
The best feasible distribution has a crucial advantage with respect to any budget balanced learner: it has the possibility to "run some deficit" in a preliminary phase of the sequence as it knows it will be possible to extract enough profit to ensure global budget balance in some later stages.For instance, consider a half-sequence where (  ,   ) is either (0, 1 /3) or ( 2 /3, 1), for  ≤  /2.Any learning algorithm has to enforce budget balance at time  /2 (to be protected about the possibility that (  ,   ) = 0 for all future ), while the randomized benchmark, which knows the future, may run a deficit and collect more gain from trade by posting the budget unbalanced prices ( 2 /3, 1 /3) with some probability.Inspired by this example, we state the following Lemma.Lemma 6.1.For any algorithm A that enforces global budget balance, there exists a deterministic sequence of valuations  1 with the following properties: () the expected gain from trade of A is at most  /9; () the valuations (  ,   ) are either (0, 1 /3) or ( 2 /3, 1) for all  ≤  /2; () the valuations (  ,   ) are equal to (0, 0) for all  >  /2.
Proof.Consider the following randomized instance: (  ,   ) = (0, 0) for  >  /2, while for the other time steps the valuations are either (0, 1 /3) or ( 2 /3, 1), independently and uniformly at random.Each realized instance of such randomized sequence clearly satisfies requirements () and ().Finally, we show that in expectation (with respect to the randomization of the algorithm and of the instance), the total gain from trade of A is at most  /9.Then the existence of an instance  1 with the desired properties follows by an averaging argument.Focus on the first  /2 time steps, and let  1 be the random variable that counts the number of time steps in which A posts prices   ≤ 1/3 and   ≥ 2/3.Moreover, let  2 =  /2 −  1 , and for  ∈ {1, 2}.By assumption, Algorithm A is global budget balanced, which this means that The first term of the inequality follows from the fact that every time the learner posts   ≤ 1/3 and   ≥ 2/3, it loses at least 1/3 revenue.The second term follows from the fact that, by posting other pairs of prices, the learner can extract at most a revenue of 1/6 (i.e., the trade happens with probability 1/2, and the learner receives 1/3 revenue).On the other hand,  1 is directly proportional to the final gain from trade of A, thus the best possible gain from trade is achieved for  1 = /6 and  2 = /3, which yields an expected gain from trade of at least This concludes the proof of the claim.

□
The lemma is crucial in proving the impossibility result in the following Theorem, which holds even under full feedback.
where we used that prices ( 2 /3, 1 /3) are posted with probability 3 /7 and always induces a negative profit of 1 /3 and that, by construction, there are at least 3 /4 time steps where (  ,   ) = (ŝ, b).To conclude the proof, we analyze in a similar way the total gain from trade achieved by : All in all, we have constructed an instance,  2 where A exhibits an expected gain from trade of at most 5 /18, while OPT is at least 2 /7.This means that A suffers at least the following -regret □

Comparison of the Two Benchmarks
Surprisingly, it holds that the performance of the optimal fixed price is to not far from that of optimal global budget balanced distribution.GFT  ( * ).
Proof.Fix any sequence of valuations , and let  * be as in the statement.By standard analytic arguments, it is possible to show that there exists an optimal feasible distribution  * whose support is either one or two points (we refer to Proposition A.3 in Appendix A.1 for a formal proof).We prove the result using this  * and considering two separate cases, according to the cardinality of the support of  * .If the support of  * consists of only one point (, ), and since  * has respect budget feasibility, then it is safe to assume without loss of generality that such point lies above the diagonal, i.e.,  ≤  and that the gain from trade achieved by  * is exactly the same provided by  * .In the second case, the support of  * consists of two different points ( 1 ,  1 ) and ( 2 ,  2 ).If both prices lie in the upper left diagonal (i.e.,  1 ≤  1 and  2 ≤  2 ), then the total gain from trade is exactly the same as  * , by maximality of  * .If one of the two pair of prices is strongly budget balance, let's say  1 =  1 and  2 <  1 , then the only possibility (by the budget balance condition) is that these prices never incur in negative profit, so that their gain from trade is once again at most that of  * .All in all, the only meaningful case to study is when  1 <  1 and  2 >  2 .Consider then this case, i.e.,  1 <  1 and  2 >  2 , let  0 be the set of time steps in which the trade is lost by For all other  ∈ [] \  0 , every prices ( 2 ,  2 ) make the trade happen.We further partition these time steps as follows: The sets  0 , . . .,  3 partition the time horizon.Now, for each one of these subset of time steps   it is possible to define two functions over [0, 1] 2 : We adopt the usual convention to omit the second argument if it coincides with the first one.Clearly, the sum of the   yields the total GFT, while that of the total   the Profit.We relate the value of functions  0 ,  1 ,  2 in ( 2 ,  2 ) with the total gain from trade it collects.The trades in  0 are lost by ( 2 ,  2 ), so it holds that  0 ( 2 ,  2 ) = 0 and  0 ( 2 ,  2 ) = 0.For  1 and  2 these simple bounds hold: We move our attention to  3 , where a more sophisticated argument is needed.As a preliminary step, we prove that the profit extracted by ( 1 ,  1 ) is at most the optimal gain from trade: Let  1 , respectively  2 , be the probability with which  * draws ( 1 ,  1 ), respectively ( 2 ,  2 ) we have: where the first inequality follows by the definition of  3 , the second by the fact that the only negative profit by posting ( 2 ,  2 ) comes from  3 , the third by global budget balance of , and the last one by Equation ( 16).We finally have all the ingredients to conclude the proof: GFT  ( * ) (by Eq. 15 and 17) where the last inequality follows by optimality of  * with respect to the budget balanced prices ( 1 ,  1 ) and using that  1 +  2 = 1.

□
As a corollary, we have that any algorithm that achieves sublinear regret with respect to the best fixed price also guarantees sublinear 2-regret with respect to the best feasible prices distribution.
Corollary 6.4.Let A be a learning algorithm for the repeated bilateral trade problem which guarantees an upper bound of  () on the regret with respect to the best fixed price in hindsight.Then, the 2-regret of  with respect to the best budget feasible distribution over prices is at most  ().Surprisingly, the factor 2 between the two benchmarks is optimal.This implies that the analysis of the performance of the algorithms in Corollary 6.4 is essentially tight.Proof.Fix any  > 0, and let  be a positive number we set later.Consider the sequence where (  ,   ) = (0, 1 /2 − ) if  is odd, and (  ,   ) = ( 1 /2 + , 1) otherwise.Any fixed price can make at most half of the trades happen, with a total gain from trade of at most  /2 ( 1 /2 − ).
Consider now the distribution over prices  selecting ( 1 ,  1 ) = ( 1 /2 + , 1 /2 − ) with probability  = (1−2) /(1+6), and ( 2 ,  2 ) = (0, 1 /2 − ) otherwise.We conclude the proof by arguing that  satisfies the budget balance constraints, and attains total gain from trade that is roughly twice that of  * .First, we show that  is global budget balanced.We have , where in the last equality we use the definition of .We move our attention to the gain from trade: GFT  ( * ).
Plugging in the last formula the definition of  and setting  =  /8 yields the desired result.

□ 7 Final Remarks and Open Problems
In this paper we introduce the notion of global budget balance in the repeated bilateral trade problem.With this notion, we show for the first time that it is possible to achieve sublinear regret with respect to the best fixed price in hindsight, without relying on any additional assumption.In the full feedback model we prove that the minimax regret rate of the learning problem is Θ( √ ), while in the partial feedback models, we provide an upper bound on the regret of order Õ( 3 /4 ), which is complemented with a Ω( 5 /7 ) lower bound.Our regret results proves a clear separation between the two feedback models, but leave an open gap between the  5 /7 and  3 /4 rates in partial feedback.Inspired by Bandits with Knapsack, we formulated a new benchmark: the best feasible distribution over prices.Against this harder benchmark we prove that it is possible to achieve sublinear 2-regret, while no algorithm can achieve sublinear (1 +  0 )-regret.We leave as an open question the characterization of the optimal competitive ratio  ∈ [1 +  0 , 2] obtainable against this benchmark.As a first step, we show that the support of  can be restricted to a discrete grid .To simplify the exposition, we sort the sets of valuations {0, 1,  1 , . . .,   } and {0, 1,  1 , . . .,   } in increasing order.Formally, we define the set { 0 = 0,  1 , . . .,   ,  +1 = 1}, where   ≤  +1 for each , and {  } +1 =0 = {  }  =1 ∪ {0, 1}.Similarly, we define the set { 0 = 0,  1 , . . .,   ,  +1 = 1}, where   ≤  +1 for each , and {  } +1 =0 = {  }  =1 ∪ {0, 1}.3The grid  contains all the points of the form (  ,   ) with ,  ∈ {0, 1, . . .,  + 1}.

A.1 Well Posedness of the Two Benchmarks
3For the sake of clarity, we assume that   ≠   and   ,   ∉ {0, 1}, and   ≠   and   ,   ∉ {0, 1} for each ,  ∈ [].It is easy to extend our results to the general setting.
Proof.Since for all  ∈  1 ∪  2 ∪  3 ∪  4 ∪  5 the weights   () are positive for all  ∈ {0, . . .,  − 1} we just need to prove that   () is positive for all  ∈  6 .Then, using the upper bound on  tot 3 < 2 1  ≤ 1/32 we get that: This, together with the fact that  6 ≤ 1/4, proves that all the probabilities in the instances' distributions   are well defined.and thus for all  we have that: E  [GFT( 1 ,  1 , , ) − GFT( 1 ,  1 ′  , , )] > 0. The proof is concluded by noting that, for any (, ) ∈   ∩ {(, ) ∈ [0, 1] 2 |  < } in the lower triangle, the gain from trade is upper bounded by that of some ( 1 ,  1 ′  ), that is where the first inequality follows by Lemma 5.6 while the last inequality holds for any  ≥ 2.Then, we divide the analysis in two cases.Intuitively, the first one correspond to the cases in which the algorithm does not explore enough (i.e.,  E 0 [  ] is small) and, therefore, it cannot correctly identify the instance.In the second case the algorithm spends a large time exploring (i.e.□

Proposition A. 1 .
For any sequence of valuations  there exists a price  * ∈ [0, 1] such that:  (A, ) =  =1 GFT  ( * ) − E  =1 GFT  (  ,   ) .Proof.Denote the cumulative gain from trade of a pair of price (, ) as follows: (, ) =  =1 GFT  (, ).(18)This function is upper semi-continuous on the upper left triangle {(, )∈ [0, 1] 2 |  ≤ } (see Claim A.2), thus the sup in the definition is indeed a max.For the remaining part of the statement, let ( p, q) be any pair of prices in the arg max of Equation (1).It is easy to see that any price  * ∈ [ p, q] achieves the same total gain from trade, while trivially respecting the budget balance constraint.□Claim A.2.The function  defined in Eq. (18) is upper semi-continuous on {(, )∈ [0, 1] 2 |  ≤ }.Proof of Claim A.2.The function  is the sum of  terms of the following form:GFT  (, ) = I{  ≤ }I{ ≤   }(  −   ).Moreover, for pairs in {(, ) ∈ [0, 1] 2 |  ≤ } the gain from trade is non-zero only for steps  ∈ [] such that   ≤   .This implies that  is the sum of at most  step-functions that are upper semi-continuous.□PropositionA.3.The definition of the best fixed distribution is well-posed.Moreover, there always exists a feasible distribution  * with support at most two that attains the sup.Proof.Fix any sequence of valuations {(  ,   )}  =1 , we introduce two auxiliary functions: (, ) =  =1GFT  (, ) and (, ) = =1Profit  (, ).We can rewrite the program in Definition 2

Learning Protocol of Repeated Bilateral Trade
The budget of the learner is updated   ←  −1 + Profit  (  ,   )
8)play (, ) s.t.ℎ  (, ) =  otherwise (Line 11) where x  is the distribution computed by Hedge for block .The following lemma states precisely the guarantees provided by Block-Decomposition. Block-Decomposition with  =  1 /4 and  =  1 /2 guarantees: Theorem 6.2.Fix any constant  ∈ [1, 36 /35), and any globally budget balanced learning algorithm A with full-feedback.Then there exists a sequence of valuations such that Fix any  ∈ [1, 36 /35) and any learning algorithm A. Starting from sequence  1 as in Lemma 6.1, construct a second sequence of valuations  2 which coincides with  1 for the first half of the time horizon.In the second half we set (  ,   ) = (ŝ, b) for all  >  /2, where (ŝ, b) is the most frequent value in the first half of  1 .We compare the total gain from trade collected by A on  2 with that of the best fixed distribution of prices over  2 , whose gain from trade we denote OPT.The expected gain from trade of A on  2 is at most  /9 in the first half (Claim 6.1) and  /6 in the second half (as it can extract at most 1 /3 gain from trade in each one of  /2 time steps).Therefore, A extracts at most a total of 5 /18 expected gain from trade.On the other hand, the best feasible distribution  * must perform at least as well as the feasible distribution , under which the prices are (ŝ, b) with probability 4 /7, and ( 2 /3, 1 /3) with the remaining probability.First, we argue that distribution  is indeed budget-feasible: Denote with  * , resp. * , the best fixed price, resp.the best feasible distribution.Then, for any sequence of valuations: * GFT  (, ) ≤ 2  =1 ,  E 0 [  ] is large), and thereby accumulates large regret (by Lemma 5.8).Then, there must exist at least an instance  ∈ {1, . ..,  − 1} in which    = Ω().andhence in the base instance the regret is at least Ω   2 .