Post-hoc Selection of Pareto-Optimal Solutions in Search and Recommendation

Information Retrieval (IR) and Recommender Systems (RS) tasks are moving from computing a ranking of final results based on a single metric to multi-objective problems. Solving these problems leads to a set of Pareto-optimal solutions, known as Pareto frontier, in which no objective can be further improved without hurting the others. In principle, all the points on the Pareto frontier are potential candidates to represent the best model selected with respect to the combination of two, or more, metrics. To our knowledge, there are no well-recognized strategies to decide which point should be selected on the frontier. In this paper, we propose a novel, post-hoc, theoretically-justified technique, named"Population Distance from Utopia"(PDU), to identify and select the one-best Pareto-optimal solution from the frontier. In detail, PDU analyzes the distribution of the points by investigating how far each point is from its utopia point (the ideal performance for the objectives). The possibility of considering fine-grained utopia points allows PDU to select solutions tailored to individual user preferences, a novel feature we call"calibration". We compare PDU against existing state-of-the-art strategies through extensive experiments on tasks from both IR and RS. Experimental results show that PDU and combined with calibration notably impact the solution selection. Furthermore, the results show that the proposed framework selects a solution in a principled way, irrespective of its position on the frontier, thus overcoming the limits of other strategies.


INTRODUCTION
Many tasks in Information Retrieval (IR) and Recommender Systems (RS) involve the optimization of multiple objective functions.As an example, consider the IR task of diversifying search results where, given a user query, we require the IR system to return a list of results that are both relevant for the user and diverse concerning the possible "facets" of the query [40].Addressing this task asks for designing a two-objective ranking function comprehensively maximizing both the relevance and the diversity of the result list.The same considerations can be made in RS.Despite the accuracy of recommendation being considered the gold measure to assess the quality of suggestions, over the last years, RSs have been required to meet other beyond-accuracy metrics to avoid obvious [46] and unfair [51] recommendations.Therefore, the choice of a recommendation model and its setting entail several criteria leading to a trade-off among them, resulting in a non-trivial challenge.
Multi-objective Optimization (MOO) recently attracted several interesting IR and RS contributions [15,41,51].MOO deals with Pareto optimality, i.e., the identification of solutions where no objective can be further improved without damaging the others.Paretooptimal solutions are in turn collected in the so-called Pareto Frontier, a set of (possibly infinite) non-dominated solutions.
Existing approaches for MOO can be classified into two categories: i) heuristic search and, ii) scalarization.In the first category, multi-objective evolutionary algorithms are used to ensure that the emerging solutions are not dominated by each other, even if they can still be dominated by Pareto-optimal solutions not visited by the algorithm [6,39].In the second category, scalarization methods aggregate multiple objectives into one objective, possibly guaranteeing Pareto optimality.Scalarization approaches can exploit model aggregation techniques combining the output of different models trained on the single objectives.Alternatively, label aggregation techniques combine the labels of the training samples a priori, and the optimization is performed using the aggregated label.Aggregation techniques may involve the setting of the importance or priority of the different objectives by weighting each objective through a scalar function, e.g., Linear Scalarization [31], Weighted Chebyshev [27].Conversely, some techniques work by constraining the objectives of the problem, e.g., -Constraint [17] leading to a unique non-dominated solution.
Pareto optimality is commonly achieved by many different Paretooptimal solutions.However, IR and RS MOO tasks generally require identifying a single Pareto-optimal solution to be deployed in the system.To the best of our knowledge, no strategies specifically tailored to IR and RS tasks have been previously proposed [51].The state-of-the-art techniques from MOO theory are in fact aimed at identifying a set of Pareto-optimal solutions, without addressing the problem of post-hoc choosing one among the-possibly manysolutions identified for the IR and RS tasks.Indeed, many works in the IR and RS literature, although exploiting the techniques discussed above, do not either: i) consider the problem of selecting a single best solution to the multi-objective problem or, ii), discuss the criteria adopted to select a single Pareto-optimal solution [53].
In this paper, we fill this gap by introducing "Population Distance from Utopia" (PDU), a novel post-hoc flexible strategy for selecting one-best-Pareto-optimal solution among the ones lying in the Pareto frontier for IR and RS tasks.PDU relies on the observation that the Pareto-optimal point coordinates are an aggregationusually the mean-of the model performance for each sample, i.e., queries in IR and users in RS, on multiple objectives.PDU exploits the notion of "Utopia point" as the ideal optimization target.Differently from the methods from MOO theory, which are devised to solely consider the mean performance values when selecting a single Pareto-optimal solution, PDU is designed to set a utopia point for each sample of the dataset.This feature allows choosing a solution not only based on the "global" performance achieved by the IR/RS model, but also in a more fine-grained resolution that now considers multiple quality criteria that are expressed on a sample level.We call this feature "calibrated" selection.In detail, the novel contributions of this paper are: • We formally introduce PDU as a novel technique that allows one to select, in a principled way, the best Pareto-optimal solution previously identified by a state-of-the-art MOO technique.• We provide a thorough comparison of PDU against state-of-theart selection strategies.The analysis shows that PDU is the only selection method that allows identifying a "calibrated" solution, i.e., based on ideal targets expressed on a sample level.• We experimentally compare PDU against state-of-the-art strategies on well-known IR and RS tasks by exploiting public data.The results show that, unlike other methods, PDU can identify Pareto-optimal solutions regardless of their position on the frontier.Moreover, PDU calibration can lead to the selection of significantly different trade-offs.• We release a GitHub repository1 for our implementation of PDU and the state-of-the-art competitors as well as the data used in the experiments to allow a full reproducibility of the results.
Note that a MOOP as defined in Equation ( 1) supposes that improving one objective means minimizing it.However, an objective  (•) of the vector of functions f(•) might have either a maximization or a minimization goal.In this sense, maximization of the function  (•) may be readily rewritten as − (•) to meet Equation (1).Pareto Optimality.In a single-objective optimization problem, the optimal solution is defined as the objective function value that proves the scalar relation "less than or equal" (≤).In contrast, in a MOOP, since typically there is no single global solution, it is impossible to determine a set of points that all fit a predetermined definition for an optimum.Hence, it is usually adopted the concept of Pareto optimality which leverages on the Pareto dominance relation [47].A vector x ★ Pareto-dominates vector x, denoted by x ★ ≺ x, if and only if ∃ ∈ {1, . . .,  } |   (x ★ ) <   (x) and   (x ★ ) ≤   (x) ∀ ∈ {1, . . .,  − 1,  + 1, . . .,  }.We also write that, a solution x ★ ∈ X is Pareto optimal if there does not exist another solution x ∈ X such that f(x) ≺ f(x ★ ).In other words, a point is Pareto optimal if there is no other point that improves at least one objective function without hurting another one.Then, solving the problem in Equation ( 1) means finding the solutions x ∈ X such that their images f(x) are not Pareto-dominated by any other vector in the feasible set.The set of non-Pareto-dominated solutions  ★ ⊆ X is called Pareto-optimal set in the feasible set, that is formally defined as The image of the Paretooptimal set  ★ in the objective function space is called the Pareto frontier, i.e., Utopia and Nadir Points.Once a solution  ★ for the problem in Equation ( 1) is obtained, the decision-making process requires the selection of a single optimal solution from the Pareto frontier.Generally, the utopia point helps to implement this process [31].A point Generally, the utopia point is the ideal point in R  that is unattainable.Hence, a common approach consists in reaching the closest solution to the utopia point as the best one, where, in most of the cases, the term closest refers to the solution which minimizes the Euclidean distance to the utopia point.However, it is not necessary to restrict closeness to the case of a Euclidean norm [31].
Along with the utopia point, the nadir point also helps select a solution from the Pareto frontier.Dually to the utopia point, the nadir point represents the point in the objective function space having the worst possible values for each objective.A point f △ ∈ R  is a nadir point if and only if f △  = max x f  (x) | x ∈ X ∀ ∈ {1, 2, . . .,  }.Compared to the utopia point, determining the nadir point can be challenging, even for simple problems [26].

BACKGROUND 3.1 Selection Strategies
The Pareto frontier consists of a set of equally optimal solutions.Some methods to select a single Pareto-optimal solution assume the existence of a decision maker [25].These methods are known as Multi-Criteria Decision Making (MCDM) strategies, where a decision-maker has knowledge of the preferences (hierarchy) among the objectives.However, decision-makers do not always know how to weigh the different objectives [5].Moreover, in some cases, the complexity of the problem makes it difficult for a human decisionmaker to evaluate and compare different options comprehensively.Conversely, mathematical methods can provide consistent, objective, and impartial decision-making approaches.In this work, we focus and outline mathematical strategies for selecting a solution from the Pareto frontier, i.e., strategies applicable in the absence of "a priori knowledge" that can feed an MCDM strategy.
3.1.1Knee Point.The Knee Point [5] strategy aims to identify a knee of the Pareto frontier.The rationale is that solutions different from the knee point would exhibit limited improvement for one objective and a substantial deterioration for the others.As described by Branke et al. [5], these strategies were born as a variation of multi-objective evolutionary algorithms to find the knee regions on the Pareto frontier.Consequently, when other algorithms compute the Pareto Frontier, the extracted knee region may not have a kneefeatured shape, thus making this strategy less convenient.Several methods to identify the knee point are proposed in the literature, mainly differing for the number of objectives.Angle-based method (A-KP).When dealing with two objectives, the reflex angle between the slopes of the two vectors through a point  = (  ,   ) and its two neighbors, i.e.,  = (  −1 ,   −1 ) and  = ( +1 ,  +1 ), on the Pareto Frontier can be considered as an efficient indication of whether the point can be classified as a knee [5].The Pareto-optimal point having the maximum reflex angle computed from its neighbors is considered the knee [12].If no neighbor to the left (right) is found, a vertical (horizontal) line is used to calculate the angle.Even though this method is efficient in a two-dimensional scenario, it becomes impractical for more than two objectives, especially for the choice of neighbors.Utility-based method (U-KP).A valid alternative to overcome the limitation of the angle-based method is adopting a marginal utility function.Let us consider a set of  objective functions  (•) and  sets of  uniformly distributed weights w, with   ∈ [0, 1] such that    = 1 [5].The resulting utility function is then  (x, w) =    •   ().The solution having the minimum utility value (Paretooptimal solution) for most weight configurations is the knee point.

Hypervolume.
The Hypervolume [54] strategy was first introduced to compare the quality of different Pareto frontiers [14].However, by computing the hypervolume of each solution on the Pareto frontier, this strategy can be straightforwardly exploited to select the best solution from the set [53].Given a Pareto-optimal solution x ★ ∈ R  , a reference point r ∈ R  , and the Lebesgue measure , the hypervolume H V of x ★ with respect to r is: ( The H V value is the volume of the hypercube determined by the solution x ★ and the reference point r.The Pareto-optimal point having the maximum hypervolume is the selected one. The Pareto-optimal point having the minimum Euclidean distance is the selected solution.Instead, the Weighted Mean (WM) requires setting the importance of each objective through a set of weights.Among all the Pareto-optimal points, the point maximizing the weighted mean corresponds to the selected solution.

Related Work on MOO for IR and RS
While a discussion of the state-of-the-art selection approaches has been provided in Section 3.1, we now briefly summarize the main contributions of MOO in IR and RS.Previous works investigate the introduction of multiple criteria in IR systems, e.g., in web search and recommendation [10,11,21,44,45,49], and product search [22,29].Carmel et al. [9] propose Stochastic Label Aggregation (SLA), a technique that perform label aggregation by randomly selecting a label per training example according to a given distribution over the labels.In RS, Lin et al. [28] propose a scalarization based Pareto-Efficient Learning-To-Rank (PE-LTR) framework by deriving the conditions for the weighted sum weights that ensure the solution to be Pareto efficient.In the RS area, MOO techniques are routinely exploited for optimizing multiple fairness criteria beyond relevance.Ge et al. [15] propose a fairness-aware RS based on multi-objective reinforcement learning, simultaneously optimizing clickthrough rate (CTR), as a signal for relevance, and item exposure, as a signal for fairness.Moreover, Wu et al. [51] employ scalarization to optimize accuracy along with both provider and consumer fairness.Naghiaei et al. [32] also integrate fairness constraints from a consumer and producer-side into a re-ranking approach.

POPULATION DISTANCE FROM UTOPIA
Driven by the goal of overcoming the limitations of the other methods in a principled way for IR and RS, we propose PDU (Population Distance from Utopia), a selection strategy taking into account the distance of the query/user metric from the utopia point.
Our intuition starts from the observation that in a search and/or recommendation scenario, the Pareto frontier is populated by points representing aggregated results (usually, they represent the average value) on metrics referring to a set of experiments.For instance, in a RSs setting, we could have a frontier representing the values of two metrics: nDCG, to measure the accuracy of the model, and Intralist Diversity (ID), to measure the diversity in the list of recommended items.Each point on the frontier may represent the corresponding values of nDCG and ID for a specific configuration of the hyperparameters.It is worth noticing that these values are computed as the value of the given metric averaged on all the system users.Suppose we focus instead on the point representing the single user.In that case, we may also reconsider the notion of utopia point in this more fine-grained view and adapt it to generalize with respect to the single user.The same observations hold in a search setting where we have queries instead of users.The questions leading our proposal are then: i)What happens if we focus our analysis on the original points instead of their aggregated representation?ii) Can we characterize each of these fine-grained points and exploit a generalized definition of utopia point that considers even the single user/query?
We start by defining a generalized version of the utopia point.
A point In our definition, ℎ  is a function that considers the characteristics of the original data and returns a desired but unattainable utopia value for the -th metric.For a (non-generalized) utopia point  ⋄ , we have h  = min x f  (x).Its definition can be driven both by system or dataset properties and by the choices of the system designer.For instance, in Section 5.1, we define ℎ 2 (see Equation ( 15)) to quantify the users' popularity tendencies stemming from their past interactions with the items in a recommendation scenario.Given a Pareto-optimal solution x ★ ∈ R  , we can assume that it is the image of an aggregation function applied to a set of  points x  in R  , with  ∈ {1, . . ., }.In our previous example, the points represent the values of the pairs ⟨, ⟩ (with  = 2) for the  users in the system.Suppose a generalized utopia point f •  ∈ R  , with  ∈ {1, . . ., }, is associated to each point x  .Definition 4.1.The Population Distance from Utopia (PDU) is: where  : R  → R is an error function that satisfies the conditions of identity, symmetry, and triangle inequality.The Pareto-optimal point having the minimum PDU is the selected solution.The error function (•) is parametric, i.e., we can set any error or distance metric as (•), like Euclidean distance or mean squared error.
Derivation.Let us consider an objective function space R  , where  is the number of objectives, and a dataset D of  samples (users/queries).For each sample, we suppose to know the best possible value of each objective.Then, we can associate each sample with a -dimensional vector f •  , with  ∈ {1, . . ., }, which constitutes its generalized utopia point in the objective function space R  .We use F = {f •  |  ∈ {1, . . ., }} to denote the set of all the generalized utopia points referring to the  samples.Let us now consider a model  that returns  objectives performance values for each sample in D. As before, each sample corresponds to a -dimensional vector x  , with  ∈ {1, . . ., }, which represents the model performance for that sample in R  .We denote P = {x  |  ∈ {1, . . ., }}.Thus, each sample  is represented by f •  and x  in the objective function space: the closer the points, the better the model  performs.Let us introduce an error function  : R  → R satisfying the conditions of identity, symmetry, and triangle inequality.The error of the model  on the -th sample is (f •  , x  ).By supposing the error term follows the IID property, it has a Gaussian distribution with mean  = 0 and variance  2 , i.e., (f •  , x  ) ∼ N (0,  2 ), whose probability density function is: We can note that if f •  and x  are close, the exponent part of Equation ( 5) tends to 1, and the probability increases while tending to 0 when the two points are far apart and the probability decreases.
Then, we compute the error probability density function of the error for the entire dataset D. We observe that the model  has some parameters Θ.Hence, P can be expressed as a function  of the parameters Θ: P = (Θ).Then, a vector x  ∈ P can be rewritten as x  = (Θ)  .By assuming the samples to be independent, we obtain the following expression for the likelihood function: Since f •  is the (generally unattainable) output we desire to have, we are interested in finding the parameters Θ for the model  such that the likelihood function ((F, (Θ))) is the highest.As the logarithmic function is increasing monotone, it does not modify the maximum positions.Hence, we can compute the log-likelihood instead of the likelihood to simplify calculations: At this point, we explicit the variance term  2 .Since we have supposed that the error term (f •  , x  ) has a Gaussian distribution with  = 0, the variance  2 is defined as . By introducing this term in Equation ( 8), we obtain that the log-likelihood is: By supposing to train the model  on the same dataset D with several configurations of Θ, the terms depending on the dataset size  and the constant 1/2 in Equation ( 10) can be removed as they are constant when choosing the highest log-likelihood.Hence, the only variable quantity among the different log-likelihoods is: Therefore, we are looking for the model  with parameters Θ having the maximum value of the term in Equation ( 11): Finally, this remainder term can be easily rewritten with a positive sign as long as we choose the configuration of Θ for the model  having the minimum value for this quantity: PDU allows setting a generalized utopia point for each sample of the dataset, i.e., queries and users in an IR or RS scenario, respectively.This feature allows choosing a solution not only based on the "global" performance achieved by the IR/RS model, but also in a more fine-grained resolution that now considers multiple quality criteria expressed on a sample level.We call such feature calibration since it can be usefully exploited in specific scenarios, e.g., personalization in RS, where it is possible to define generalized utopia points according to individual users' preferences.These generalized utopia points can be fixed apriori, e.g., they can be identified by the system designer or computed through functions that numerically quantify the users' tendencies, similarly to what has been done in previous works regarding calibrated recommendations [20,35,42].We refer to this feature as Calibrated-PDU (C-PDU).

Feature Comparison
In Section 3.1, we have presented the most-used techniques to choose a single best solution belonging to a Pareto frontier.However, as also stated by Wu et al. [51], there is no consensus on the strategy to solve this task in the IR and RS communities.Not surprisingly, all methods have some advantages and limitations, leading to a lack of an ideal strategy [26].Hence, a comparison of the features provided by PDU and state-of-the-art techniques is needed.Specifically, we identify some desirable features the techniques should have.Table 1 discusses the main properties of PDU and other state-of-the-art techniques.First, the strategy should be suitable even when dealing with more than two objectives.
In this regard, the angle-based knee point is the only ineffective method.Second, the strategy should not need any additional knowledge.Most techniques require additional problem information, i.e., the reference point (H V), the (generalized) utopia point (ED, PDU), and a weights set (WM).Since the results of a given strategy can largely depend on such information, a fair strategy should require as less additional information as possible.The weights should be set by a decision-maker with deep knowledge of the hierarchy among the objectives.In contrast, the reference and the (generalized) utopia points are ordinarily intrinsic to the problem.Despite some common practices (e.g., nadir point) [26], it has been shown that determining a reference point r for H V is generally more challenging [18,26], and a badly defined reference point can lead to inconsistent evaluation results [24].Indeed, having a reference point slightly different from the nadir point could lead to incongruous evaluation, as experimentally demonstrated by Ishibuchi et al. [19].Therefore, the utopia point is the most effortlessly additional information that can be exploited for this task.Third, the strategy should not require to scale the range of the objectives.Scaling may be needed for strategies whose calculation involves objective blending, i.e.U-KP, ED, WM, and PDU.When the objectives have different scales, the bigger the range of an objective, the bigger its contribution to the selection of a solution.However, the choice of scaling the objectives is left to the system designer.Fourth, the strategy should be deterministic.The U-KP strategy requires randomly extracting a set of weights from a uniform distribution.This could potentially affect the consistency and reproducibility of results.Fifth, the strategy should equally promote the solutions despite their position on the Pareto frontier.The strategies blending the objectives are not biased to select solutions based on particular Pareto frontier regions.This is not true for the H V strategy that tends to promote the solutions on the concave region of a Pareto frontier.
Final Observations and Calibration.To summarize, none of the strategies own all the properties.However, some considerations can be made.A-KP and U-KP are characterized by huge drawbacks.The former can be utilized only in contexts considering two objectives.The latter is nondeterministic.Furthermore, none of the techniques is able to select a solution irrespective of its position on the Pareto frontier and to be independent of scaling the objective ranges before calculation simultaneously.In this regard, a system designer could prefer to adopt a technique able to fairly choose a solution despite its position on the Pareto frontier (as done by U-KP, ED, WM, and PDU).Indeed, scaling the objectives can be easily performed with a simple operation such as min/max normalization.Furthermore, this operation is subject to the system designer, who can consider the objectives range in specific applications.Concerning the additional knowledge problem, only A-KP and U-KP do not need supplementary information.However, as stated before, they are characterized by main drawbacks.Then, such additional knowledge is required.Among the remainder techniques, PDU and ED exploit easier-to-define additional material, i.e., the utopia point.
By looking beyond, the proposed PDU allows us to define a utopia point for each sample in the dataset.While the other approaches exploit only aggregated models' performance, PDU opens to a novel "calibrated" way to select one-best Pareto-optimal solution tailored to individual sample characteristics.To the best of our knowledge, this is the first attempt to introduce this kind of feature in the task of Pareto-optimal solutions selection strategy.
From now on, when no confusion arises, we will use utopia point to refer also to a generalized utopia point.

EXPERIMENTAL EVALUATION
We now present an experimental evaluation based on public data that aims at answering the following research questions: RQ1: How do PDU and other state-of-the-art selection strategies behave w.r.t. the discussed properties?(see Section 4.2) RQ2: How does the distribution of the points composing the points on the Pareto frontier influence the selection of a solution?RQ3: How does the calibration feature impact the selection of a solution?

Experimental Scenarios
Driven by the observation that, in IR and RS settings, the Pareto frontier is populated by points representing aggregated results, we analyze the selection strategies in these two settings.Information Retrieval Scenario.Concerning the IR scenario, we focus on an ad-hoc search task by dealing with the efficiency / effectiveness / energy-consumption trade-off of query processing in IR systems based on machine-learned ranking models [7].IR systems heavily exploit supervised techniques for learning document ranking models that are both effective and efficient, i.e., able to retrieve within a limited time budget high-quality documents relevant to users' queries.State-of-the-art learning-to-rank models include ensembles of regression trees trained with gradient boosting algorithms, e.g., LambdaMART [7,52], and deep neural networks, e.g., NeuralNDCG [37].Since ranking is a complex task and the training datasets are large, the learned models are complex and computationally expensive at inference time.The tight constraints on query response time thus require suitable solutions to provide an optimal trade-off between efficiency and ranking quality [8,16,30].
In this scenario, we use the LambdaMART [7,52] implementation available in LightGBM [23] to train ranking models based on ensembles of regression trees and Neural Networks (NN) trained in Pytorch [36] following the optimization methodology proposed in [33].The models are trained on MSN30K [38], a public and widelyused dataset for learning to rank.The evaluation employs 11 Lamb-daMART and 5 Neural Networks ranking models, each tested on the 6,306 queries of the MSN30K test set.We measure the ranking quality of each model in terms of average nDCG@10 ( 1 ), and average ranking time (seconds per document) ( 2 ).For the LambdaMART configurations, we also measure the average energy consumption (Joules per document) ( 3 ).The average ranking time of each model has been measured by using QuickScorer [30], while energy consumption has been measured by using the Mammut library [13].Efficiency experiments are performed on a dedicated Intel Xeon CPU E5-2630 v3 clocked at 2.4 GHz in single-thread execution.QuickScorer is compiled using GCC 9.2.1 with the -O3 option.
In this IR experimental scenario, we focus on selecting the best efficiency/effectiveness trade-off for query processing.
Recommendation Scenario.Concerning the RS scenario, we consider two of the main problems of recommendation algorithms, i.e., the accuracy of the recommendations and the tendency to oversuggest popular items.Often, the ability of RS to provide accurate recommendations is competing with the capability of including long-tail items in such suggestions [32], inducing a trade-off.Hence, we consider two objectives.We compute the Recall ( 1 ) to measure the accuracy of suggestions and the average percentage of items in the long-tail (APLT) [2] ( 2 ) to measure to what extent a RS can recommend unpopular items: where |U  | is the number of users in the test set, L  is the list of recommended items to user , and Φ is the set of long-tail items.The higher the metric, the higher the number of niche items suggested.Specifically, we interpret APLT from two perspectives, identifying two experimental scenarios.On the one hand, we assess APLT from provider-side fairness.The provider side fairness can be quantified as the models' ability to expose items to users evenly [1,2,51].Indeed, the over-recommendation of popular items, i.e., the so-called unfairness of popularity bias, may be felt as unfair by providers who get long-tail items under-represented in the suggestions.Hence, in this scenario, the goal is to choose a model that promotes relevant items without affecting niche items' visibility.
In this first RS experimental scenario, we focus on selecting the best recommendation model dealing with multiple objectives.
On the other hand, we evaluate APLT from the final user point of view.Indeed, certain users may prefer to consume popular items, while others niche items.Consequently, exclusively recommending mainstream items would hurt the experience of long-tail users, and vice versa.The approach of calibrated recommendation has shown a valuable solution toward this direction of research [35,42].A recommendation list is calibrated concerning popularity when the set of items it covers matches the user's profile in terms of item popularity [3].Inspired by the concept of popularity-based calibrated recommendation, for each user, we compute the values of the APLT target ( 2 ) stemming from their popularity profile.To this end, we compute the user-level APLT utopia values using the weighted combination of mean and standard deviation method described by Jugovac et al. [20].We consider the set of users U, the set of items I, and the mean number of transactions  in the training set.For each item  ∈ I, we assess its popularity   by counting the number of transactions the item is involved in.For each user  ∈ U, we define the set Γ  = {  |  interacted with }.We quantify the user  popularity tendencies as , where  and  are set to a fixed value of 1 as done in [20],  (•) and  (•) are the mean and standard deviation operators, respectively.The higher is   , the most user  has consumed mainstream items in her past interactions.Finally, we normalize   and compute the APLT utopia value for each user: where Φ and Ψ are the sets composed by   values such that  is one of the less and most  consumed items, respectively.With this normalization, the higher is  • 2 , the less popular is the user profile.In this second RS experimental scenario, we show how important a calibrated technique is for choosing the best recommendation model dealing with multiple objectives.
In the two experimental scenarios presented for RS, we exploit the EASE  recommendation model [43], which works like a shallow autoencoder.This model is characterized by a single hyperparameter to tune, i.e., the L2-norm regularization ().Nevertheless, it has been shown that it often outperforms other state-of-theart recommender systems [4].Specifically, we explore the hyperparameter  by training 48 configurations on the book-domain dataset Goodreads [48] (18,892 users, 25,475 items, and 1,378,033 transactions) and on the music-domain dataset Amazon Music [4] (14,354 users, 10,027 items, and 145,523 transactions).We split the datasets following the 70-10-20 hold-out strategy.Thus, the evaluation of this scenario employs 48 solutions on the objective function space, each tested on the remaining users of the test set (18,070 of Goodreads, and 14,354 of Amazon Music).

Experimental Methodology
The different hyperparameter configurations introduced before, for the two IR and RS settings, generate solutions in the objectives function space for each specific experimental scenario.Once the Pareto-optimal solutions that compose the Pareto frontier are identified, we select one by applying PDU and the other selection strategies we analyzed in this work.The selected solutions are (e) EASE  , Recall / APLT trade-off, Amazon Music.then analyzed according to the features introduced in Section 4.2.Moreover, we investigate in detail how the formulation of PDU and its calibration feature influence the choice of the one-best solution by looking at the distribution of points composing that solution.This Section details the experimental settings employed for each selection strategy.We refer to the reference point and the utopia point with r and f • , respectively.Furthermore, we use the Euclidean distance as (•) in the formulation of PDU, to have an immediate comparison with  to assess the impact of the points distribution composing a solution.Tables 2, 3, and 4 report the results for the solutions chosen by at least one strategy.For the sake of completeness, the reader may find the complete sets of results in the GitHub repository.The best values for each metric are reported in bold, while the arrows indicate whether better stands for low ↓ or high ↑ values.
Experimental settings for the IR scenario.A nadir point cannot be established for the IR scenario because two of the objectives, i.e., efficiency and energy consumption, are not bounded in the opposite direction of the optimization target.For this reason, we define the reference point by slightly worsening the worst values reached by the optimal solutions available.By doing so, we end up setting r = (0.5, 0.00002, 0.001) for H V. Moreover, we set f • = (1, 0, 0) for , and for each sample in the dataset in PDU.For what regards WM, we equally treat the objectives by setting each weight to 0.5.
Finally, in this scenario, we do not apply any normalization to the objective values achieved with the different models.
Experimental settings for the RS scenario.Differently from the IR scenario, a nadir point can be established here because the two objectives under consideration, i.e., Recall and APLT, are bounded.We thus set r = (0, 0) for H V, and f • = (1, 1) for .As before, we give equal importance to the objectives in WM by setting each weight to 0.5.Concerning PDU, we set  • 1 = 1 for each sample utopia point as we want all users to have accurate recommendations.Instead, we set  • 2 = 1 in the first RS experimental scenario, while we compute specific values of  • 2 for each user as in Equation ( 15) in the second RS experimental scenario.Finally, in both RS scenarios, we apply a min-max normalization to the objectives.We discuss the results obtained in the different scenarios.We first divide the discussion according to both IR and RS scenarios for RQ1 and RQ2.
Then, we answer RQ3 by exploiting the second RS scenario.

Performance Comparison (RQ1)
IR scenario.We answer RQ1 by first focusing on the IR scenario.The results for this scenario are summarized in Tables 2 (Lamb-daMART) and 3 (Neural Networks).The plots in Figures 1a and 1c show the Pareto-optimal points selected by the different techniques for the cases considering two and three objectives regarding the LambdaMART models, respectively.Figure 1b shows the points selected in the case of the Neural Networks models.
Regarding the two-objective case, we observe that the methods blending the objectives (PDU, ED, WM) select the same Paretooptimal solution lying on the boundary of the Pareto frontier for both families of models, thus maximizing the accuracy at the cost of efficiency.In contrast, H V chooses an inner solution of the Pareto frontier in both cases, i.e., more efficient models, that however show a significantly lower performance in terms of nDCG compared to the selection provided by PDU (0.5225 vs. 0.5179 for LambdaMART, and 0.5185 vs. 0.5144 for the Neural Network).It is worth noting that, in this case, no transformation has been applied to the scale of the objectives, and the values of the Pareto solutions for what regards the efficiency scale lead the points to be closer to the utopia value  • 2 = 0.If a min/max normalization is applied to the objective, PDU still selects the same solution.Another essential implication arising from this analysis is that, in this scenario, we cannot establish the nadir point, making challenging the definition of the reference point.Consequently, this potentially leads to different results based on how we define the reference point.Indeed, as we push the reference point away from the Pareto frontier, H V selects a boundary solution, as done by PDU.In light of the above results, we observe that if the information related to the nadir point is unavailable, the definition of the reference point can strongly affect the selection of the final solution.Moreover, if the reference point is estimated by looking at the collection of the considered solutions, i.e., by slightly increasing the worst values reached by them, H V promotes the solution in the middle.Indeed, the definition of the reference point in such a way makes the volume of those solutions, computed as in Equation ( 2), higher than any other.Thus, H V unequally considers the remaining points lying on the boundaries of the Pareto frontier.Finally, it is worth highlighting that U-KP, although reported in Figures 1a and 1b, is not deterministic.Indeed, by executing this method several times, it may choose different points as the weights of the utility function (see Section 3.1.1)are randomly extracted from a uniform distribution.
Moving to the three-objective formulation of the IR scenario for the LambdaMART models, Figure 1c shows that when introducing the energy consumption objective, the methods tend to choose a more efficient model than the one selected in the two-objectives scenario.As before, PDU and ED tend to maximize the accuracy with respect to H V that still select solutions in the middle.The three-dimensional scenario confirms two behaviors observed in the two-dimensional one.First, the solution selected by H V depends on the chosen reference point since it is not possible to define a nadir point.Second, U-KP still exhibits a non-deterministic behavior.Finally, we claim that PDU and ED perform the most convenient selection from a qualitative perspective.By looking at Tables 2 and 3, we see that they choose the models with higher values of nDCG for all IR cases.Indeed, both efficiency and energy consumption objectives are closer to their respective utopia values.This means that more complex models, chosen by PDU and ED, guarantee considerable improvement in ranking accuracy at a small reduction of efficiency and energy consumption.Conversely, H V chooses models that exhibit a considerable decrease in terms of nDCG.RS scenario.For the first RS experimental scenario, we report the results achieved in Table 4 for the Goodreads dataset (Figure 1d) and for the Amazon Music dataset (Figure 1e).For both datasets, we notice that two well-separated clusters characterize the Pareto frontier.On the one hand, in Goodreads the EASE  configurations with lower L2 norm () values, which belong to the top-center cluster, account for the accommodation of the objectives.In contrast, the second cluster (bottom-right), i.e.,  between 10 and 100 in Table 4, maximizes Recall at the expense of the exposure of the items (lower values of APLT ).On the other hand, in Amazon Music, these two clusters of configurations follow the opposite behavior.On the one side, the configurations with  between 0.2 and 1 maximize APLT at the detriment of nDCG (top-left cluster).On the other side, the remaining configurations do not promote either Recall or APLT (bottom-right cluster).In this scenario, H V suffers less from the problem of promoting solutions in the center of the Pareto frontier.Indeed, differently from the IR scenario, here it is possible to define the nadir point as a reference point because we know the lowest bounds (0 for both APLT and Recall).Consequently, even though H V selects an inner solution in the Goodreads case, it chooses a point that tends to maximize APLT for the Amazon Music dataset.PDU follows the behaviour of H V when selecting the solutions for both datasets.By considering that it selects an outer point of the Pareto frontier in the IR scenario, this endorses the ability of PDU to equally promote the available solutions despite their positioning on the Pareto frontier.WM and ED select a solution belonging to the top-center cluster in Goodreads and to the bottom-right cluster in Amazon Music, thus enhancing the trade-off between accuracy measured in terms of Recall and items exposure in both cases.Finally, U-KP still exhibits a nondeterministic performance.
To answer RQ1 we conclude observing that PDU overcomes some limitations of H V and U-KP competitors.Indeed, PDU selects onebest-Pareto-optimal solution regardless of its position on the Pareto frontier in a deterministic way.Moreover, it exploits the concept of Utopia point as additional information.Such a concept is more convenient to use than the reference point used in H V, since, depending on the problem addressed, the nadir point is difficult to be defined.

Impact of the Points Distribution (RQ2)
We now answer RQ2 by investigating the impact on selecting the distribution of the points that compose a solution on the Pareto frontier.Indeed, PDU is the only strategy considering these points in a more fine-grained resolution.This analysis is done on both the IR (Tables 2 and 3) and RS (Table 4) scenarios.To this end, we remember that we have set (•) as the Euclidean Distance in the formulation of PDU (Equation ( 4)).Hence, even if both PDU and ED rely on the Euclidean distance, they work differently in the two experimental scenarios.This observation provides insights on the impact of the points distribution on the selection.RS scenario.PDU and ED choose different solutions for both RS datasets.In this regard, the user data points' distribution in the objective function space plays a crucial role, as visually depicted by Figure 2 for the Goodreads dataset.Indeed, the distribution of the solution with  = 0.5, chosen by PDU, shows that more points are oriented to the Utopia point than the ones of the solution selected by ED.To confirm this fact, we compute the users points' mean Euclidean distances to the utopia point of both solutions.Results confirm that the EASE  configuration selected by PDU has a lower value of average Euclidean distance, i.e., 1.3498 for  = 0.5, w.r.t. the configuration chosen by ED, i.e., 1.352 for  = 1.The same impact is observed regarding the Amazon Music dataset.Here, PDU and  select different configuration models having  = 0.3 and  = 10, respectively.As before, the EASE  configuration selected by PDU ( = 0.3) has a lower value of average Euclidean distance, i.e., 1.2361 than the configuration chosen by ED ( = 10), i.e., 1.279.IR scenario.Concerning the IR two-objectives cases, PDU and  choose the same solution for both LambdaMART and Neural Networks models.When introducing energy consumption as the third objective for the LambdaMART models,  still selects the same configuration with 878 trees and 64 leaves.Conversely, PDU chooses a more efficient model (300 trees and 64 leaves).Once more, the query points' mean Euclidean distances to the common utopia point of the model selected by PDU are lower than the ones of the model chosen by  (0.4813 vs. 0.4945).
To conclude, the answer to RQ2 is that the distribution of the points composing a solution with respect to a common utopia point has a significant impact on the final selection.This is an important fact, as it paves the way to defining selection strategies that take the distribution of the points into account while performing a selection that can be done in a more-fine-grained-sample-level way.

Impact of Calibration on the Selection (RQ3)
Finally, we analyze the impact of the calibration introduced for PDU using the second RS scenario, where we aim to tailor the selection according to the users' item popularity tastes.To this end, we assess the selection performed by Calibrated-PDU (C-PDU).
Starting from the Amazon Music dataset, the average of the APLT utopia values computed with Equation (15) (0.83) reveals that the dataset's users generally prefer less popular items.Indeed, C-PDU selects the EASE  model with  = 0.3.This solution lies on the top-left cluster of Figure1e, by maximizing APLT with a loss of Recall.In this case, C-PDU behaves similarly to PDU and H V.
Moving to the Goodreads dataset, it is characterized by users with more mainstream tastes, since the average of the APLT utopia values is equal to 0.65.Surprisingly, C-PDU is the only strategy among the ones tested selecting a model configuration belonging to the bottom-right cluster in Figure 2 where the solutions achieve higher accuracy values without promoting APLT and following the mainstream users tastes -along with U-KP that, however, has a non-deterministic behavior.These experimental results already qualitatively show the impact of defining a utopia point for each user on the final selection, since C-PDU is the only strategy to capture the users' popularity profiles for both datasets.We deepen the analysis further by considering the model configurations chosen by PDU and C-PDU for Goodreads, i.e.,  = 0.5 and  = 60, respectively.We observe that, although the model with  = 0.5 performs better on average APLT, the model with  = 60 has a lower variance of the mean absolute error (0.036 for  = 60 vs. 0.039 for  = 0.5) between the utopia values and the model performance values for each user.This indicates that C-PDU selects the model that generally follows better the users' popularity profile.In addition, this model provides more accurate recommendations on average.Hence, C-PDU chooses the model that performs better in terms of accuracy and also tailors the popular tastes of the users.
To conclude, the answer to RQ3 is that the calibration feature of PDU allows dealing with the ideal targets for each sample.This confirms that calibration is a viable way to move the selection of the Pareto-optimal solution to a more fine-grained resolution that can lead to significantly different choices in terms of the trade-off selected.

CONCLUSION AND FUTURE WORK
In this work, we proposed PDU, a novel, theoretically-justified posthoc technique to select one-best-Pareto-optimal solution among the ones lying in the Pareto frontier in search and recommendation scenarios.To our knowledge, PDU is the only selection technique in the literature that can be "calibrated", i.e., it can choose the best Pareto-optimal solution based on ideal targets expressed on single queries or users.We comprehensively compared the properties of PDU with those of competitor techniques.We conducted an extensive experimental evaluation focusing on both IR and RS scenarios, showing that the formulation and the calibration feature of PDU have a notable impact on the solution's selection.In the future, we will explore PDU by exploiting other distance metrics (e.g., Chebyshev and Manhattan).This work could open to the formulation of a new loss function based on the PDU derivation, to directly train a ranking model on multiple objectives simultaneously.

Figure 1 :
Figure 1: Pareto-optimal solutions for the IR and RS scenarios.The colored shapes represent the best-Pareto-optimal-point selected by the strategies under evaluation.

Figure 2 :
Figure 2: Distribution of users data points in the objective function space Recall / APLT for the solutions selected by PDU (left) and ED (right).The color of the point indicates the number of users in the point.

Table 1 :
Overview of the properties of PDU and other state-ofthe-art selection strategies.The symbols ✓ (✗,-) indicate that the method has (does not have, could not have) the specified property.

Table 3 :
Neural Networks selected solutions in the IR scenario.The objectives are accuracy (nDCG) and efficiency (Seconds).Seconds ↓ PDU ↓ H V ↑ U-KP↑  ↓   ↑