Representation Online Matters: Practical End-to-End Diversification in Search and Recommender Systems

As the use of online platforms continues to grow across all demographics, users often express a desire to feel represented in the content. To improve representation in search results and recommendations, we introduce end-to-end diversification, ensuring that diverse content flows throughout the various stages of these systems, from retrieval to ranking. We develop, experiment, and deploy scalable diversification mechanisms in multiple production surfaces on the Pinterest platform, including Search, Related Products, and New User Homefeed, to improve the representation of different skin tones in beauty and fashion content. Diversification in production systems includes three components: identifying requests that will trigger diversification, ensuring diverse content is retrieved from the large content corpus during the retrieval stage, and finally, balancing the diversity-utility trade-off in a self-adjusting manner in the ranking stage. Our approaches, which evolved from using Strong-OR logical operator to bucketized retrieval at the retrieval stage and from greedy re-rankers to multi-objective optimization using determinantal point processes for the ranking stage, balances diversity and utility while enabling fast iterations and scalable expansion to diversification over multiple dimensions. Our experiments indicate that these approaches significantly improve diversity metrics, with a neutral to a positive impact on utility metrics and improved user satisfaction, both qualitatively and quantitatively, in production. An accessible PDF of this article is available at https://drive.google.com/file/d/1p5PkqC-sdtX19Y_IAjZCtiSxSEX1IP3q/view


INTRODUCTION
Over the past decade, the use of online platforms has grown among all demographics and many communities have expressed the need to feel represented in content surfaced online [30].While representation has gradually improved in some media [28], it remains lacking on social media platforms and in search results and recommendations [18,29].As technology becomes increasingly integrated into the daily lives of billions of people globally, it is crucial for online platforms to reflect the diverse communities they serve.Search and recommendation systems play a significant role in users' online experiences in various applications, from content discovery to entertainment and from e-commerce to media streaming.By paying close attention to the diversity reflected in their content, these systems can break away from historical patterns of bias and move towards a more inclusive and equitable online experience.
Improving representation online can facilitate content discovery for a more diverse user base by reflecting their inclusion on the platform.This, in turn, demonstrates the platform's ability to meet their needs and preferences.In addition to improved user experience and satisfaction, this can have a positive business impact through increased engagement, retention, and trust in the platform.Gaining a deeper understanding of the diversity of user experiences and perspectives can also lead to a more diverse content corpus, which can significantly drive innovation and creativity.
In this paper, we aim to address the challenge of diversification in large-scale search and recommender systems.Our focus is on diversification mechanisms for visual discovery on Pinterest.Pinterest is the visual inspiration platform people all around the world use to discover the world's most inspiring ideas, plan their best lives, and shop to make their plans a reality.Over 460 million users [15] use Pinterest monthly to discover ideas and products from a corpus of over 11 billion visual bookmarks called Pins.Pins can be images, videos, or products saved from the web or created by Pinners, creators, publishers, and businesses on the platform.People can search for Pins, save the ones they like and click on a Pin to visit a website and learn more.We focus on diversifying recommendations on three surfaces on Pinterest: Search, Related Products recommendations, and New User Homefeed.Specifically, we develop, experiment, and deploy scalable diversification mechanisms that utilize a visual skin tone signal [12] to support the representation of a wide range of skin tones in recommendations, as shown in Figure 1 for fashion recommendations in the Related Products surface.
The end-to-end diversification process consists of several components.First, requests that will trigger diversification need to be detected across different categories and locales.Second, the diversification mechanism must ensure that diverse content is retrieved from the large content corpus.Finally, the diversity-aware ranking stage needs to balance the diversity-utility trade-off when ranking content, and to accommodate diversification across several dimensions, such as the skin tone visible in the image as well as the user's various interests.Multi-stage diversification allows the mechanism to operate throughout the pipeline, from retrieval to ranking, to ensure that diverse content passes through all the stages of a recommender system, from billions of items to a small set that is surfaced in the application.
In this work, we make multiple novel contributions to the area of diverse representation in recommender systems.
(1) We present the first visual skin tone diversification production deployment, to the best of our knowledge, to improve representation online in large-scale search and recommender systems.(2) We developed and productionized a multi-stage diversification system that operates both at retrieval and ranking stages.For ranking, we developed greedy re-rankers and multi-objective optimization using Determinantal Point Process (DPP), and for retrieval, we implemented a Strong-OR operator for search over token-based indices, as well as Overfetch-and-Rerank and Bucketized-ANN Retrieval over embedding-based indices.(3) We share learnings from productionizing diversification in a recommender system used by hundreds of millions of users.We also present the challenges, steps, and design choices to mitigate those problems.Our approaches are practical and can easily be translated to other large and complex recommender systems.(4) We provide empirical results that demonstrate the effectiveness of our various approaches at aligning users' desire for diversity and their utility with the recommendations.
As we could increase diversity without negatively impacting utility, and sometimes even increase both, the results suggest that some recommender systems may not be operating at the Pareto frontier between diversity and utility.As diversity at the end of the pipeline is upper-bounded by diversity earlier in the pipeline, the diversityutility Pareto frontier can be improved by ensuring diversity endto-end throughout the pipeline, hence the importance of earlier diversification at retrieval.This paper is organized into five sections -first, in Section 3, we describe the diversification problem, formulate the diversity metrics, and set up the general mathematical framework.Second, we outline our approach for diversification at the ranking stage of a recommender system by presenting the Round-Robin and DPP-based methods for diversifying a list of Pins with utility scores in Section 4. Third, in Section 5, we present three methods for diversifying at the retrieval layer of a recommender system that alleviate the problem of the lack of diversity at the ranking stage.In Section 6, we specifically focus on challenges and solutions to enable the practical application of these approaches in a large-scale production recommender system.Finally, in section 7, we present the empirical results of our diversity-aware ranking and retrieval methods in terms of diversity and utility metrics in our search and recommender systems.
In this paper, we detail the methods and results of our research and discuss their implications for the field of recommender systems and the broader tech industry.By addressing the challenge of diversification using practical approaches, it is possible to create more inclusive and equitable products that better cater to diverse communities.

RELATED WORK
Ranking with novelty and diversity has long been an interest in the field of information retrieval (IR).Diversity is usually defined as a goal for rankings to include different subtopics or cover different possible intents of a query [5,6,38].Many methods define evaluation measures for this task, e.g., precision and recall that counts the number of unique subtopics retrieved [37], "information nugget" based Discounted Cumulative Gain (DCG) that penalizes retrieval of redundant topics [9], and other generalizations of classical IR metrics to account for diversification [1].Maximizing these measures is generally NP-hard but the objective functions usually admit a submodular structure [6].Another goal is minimizing the probability of abandonment, or in other words, maximizing the probability of finding the intent in the top-k positions [27].Radlinski et al. [26] distinguish between extrinsic and intrinsic diversity where the former is needed due to the uncertainty in information need (multiple meanings of the query) whereas the latter is needed as a part of the information need.In contrast, diversity for fair representation in recommender systems [13,35] is motivated by a different aspect of diversity that is viewed from the perspective of the items being retrieved and ranked, rather than the information need of the user.In this paper, we present the first visual skin tone multi-stage diversification production deployment to improve representation online in large-scale search and recommender systems with hundreds of millions of users and billions of multi-modal Pins.
A wide range of techniques have been leveraged to balance diversity and utility in search and recommender systems [2,5,7,13,32,35].They range from greedy re-ranking heuristics [13], potentially using priority queues [35], to ranking methods based on pairwise similarity that greedily select results to balance the tradeoff between the relevance to the query and the redundancy with respect to previously selected results such as Maximal Marginal Relevance (MMR) [5], and to multi-objective optimization with probabilistic models such as DPP [7,32].DPP was found to be more effective at diversifying recommendations than MMR [7] because it accounts for the similarity of all pairs in the union of the selected set and the candidate item, not just the similarity between the candidate item and previously selected item.Solving DPP is NPhard.However, thanks to its submodular property, a DPP solution can be efficiently approximated using a greedy algorithm [14], that greedily selects the next item such that the incremental determinant is maximized, and it can be accelerated by updating the Cholesky factor of the DPP Kernel incrementally [7].

MULTI-STAGE DIVERSIFICATION IN SEARCH AND RECOMMENDER SYSTEMS 3.1 Background
Advanced search and recommender systems, that operate at a large scale with hundreds of millions of active users and billions of items, tend to be very complex and have multiple components.These systems leverage machine learning (ML) models trained to optimize certain objectives given inputs like queries, content features, user features, and past interactions between users and items that happened on the platform.The data in these systems can be multimodal.For instance, an input query can be in the form of text, such as queries typed by users in a search box; it can be visual when users input an image to search; or it can be a multi-modal item that consists of an image or video, a title, a description, and a link to a website.These systems often comprise two major stages: retrieval and ranking [8,21], sometimes followed by additional business logic.Items are retrieved and ranked, then the list is surfaced to the user.
Retrieval: The retrieval stage consists of one or more candidate generators that narrow down the set of candidates from a large corpus of items (in the range of 10 6 to 10 10 ) to a much narrower set (in the range of 10 2 to 10 3 ) based on some predicted scores, such as the relevance of the items to the query and the user.To achieve high recall and low latency, these systems often leverage search indices, which contain information about each item in a way that enables efficient retrieval.Token-based indices use tokens, e.g., words, as the basic unit of indexing, and are commonly used for text-based search.Embedding-based indices use continuous, dense representations of items to enable retrieval using techniques like approximate nearest neighbor search.These embeddings may be learned through traditional collaborative filtering techniques such as matrix factorization [19] or more advanced methods such as Graph Neural Networks [34] or user sequence modeling [24].A typical recommender system employs multiple candidate generators, each satisfying different criteria, and their candidates are aggregated before being passed to the next stage.
Ranking: In the ranking stage, the goal is to find an ordering of the items that maximizes an objective or a combination of objectives.The objectives may include utility metrics, diversity objectives, and additional business goals.While utility metrics are highly dependent on the application, ranking methods are often simplified to point-wise methods where the first part of ranking is item scoring.The utility scores may be generated by one or multiple ML models trained to optimize certain objectives, such as predicting the probability of an item being relevant to the query, or being clicked, saved, purchased, etc.The second part of ranking is called blending, where multiple objectives are combined to generate a ranked list.A common blending approach for multi-objective optimization is through a weighted combination so that different surfaces can tune or learn weights that best align with their intent.

Diversity in Recommendations
Diversity Dimension: Diversification aims to ensure that the ranked list of items surfaced by the system is diverse with respect to a diversity dimension of interest.Diversity dimensions may include explicit dimensions such as demographics (e.g., age, gender), geographic or cultural attributes (e.g., country, language), domainspecific dimensions (e.g., skin tone ranges in beauty, cuisine type in food), business-specific dimensions (e.g., merchant sizes), but also other implicit dimensions that may not be expressed directly but can be modeled using latent representations (e.g., embedding, clustering).While this work presents an example of production deployment of skin tone diversification, the proposed techniques are not limited to this single dimension and can support diversification more broadly, including the intersectionality of multiple diversity dimensions.We denote the set of groups under a diversity dimension as D, and each individual group is denoted by   for  ∈ {1, . . ., |D|}.Diversity Metric: Given a set of queries Q, we define the top-k diversity of a ranking system  as the fraction of queries where all groups under our diversity dimension, i.e., all   ∈ D, are represented in the top  ranked results, denoted by   ().Formally, Div@ () is defined as where I is the indicator function.Note that the top-k results   () are over the items for which the diversity dimension is defined.For instance, in the case of skin tone ranges, a Pin whose image does not include any skin tone would not contribute to visual skin tone diversity.Thus it will not be counted in the top- and will be skipped in the diversity metric computation.
In this work, we choose Div@ () as the primary diversity metric because of its simplicity and intuitiveness as an evaluation metric for the product experience.However, we also compute additional operational metrics to gain a deeper understanding of the effects of diversification on the resulting distribution.For example, the normalized entropy of the distribution with respect to the diversity dimension in top-k items, where the normalization is done against a target distribution.In the case of a uniform target distribution, the metric is called Shannon Equitability [31]).Some other metrics proposed in prior work also use divergence measures [13].Multi-stage diversification: Both retrieval and ranking stages directly impact the diversity of the final content surfaced in the application.The diversity metric at the output of the retrieval stage upper-bounds the diversity at the output of ranking.Hence, the retrieval layer needs to generate a sufficiently diverse set of candidates to ensure that the ranking stage has enough items in each group to generate a final diverse ranking set.However, diversity at the retrieval stage is not a sufficient condition to guarantee that a utility-focused ranker will surface a diverse ordering at the top of the ranking where users are more likely to focus their attention [10] and to interact with items, especially when such items belong to the long-tail of the distribution [25].Thus, the ranker also needs to be diversity-aware.
In our ranking diversification experiments on different production surfaces described in Section 7, we observed gains in diversity metrics with a neutral to a positive impact on utility metrics, such as user engagement.This suggests that the systems might not have been operating at the Pareto frontier between diversity and utility, as we could increase diversity without negatively impacting utility, and sometimes even increasing both.As we understand that the diversity at the end of the pipeline is limited by the diversity earlier in the pipeline, we can also shift the diversity-utility Pareto frontier further by ensuring diversity end-to-end throughout the pipeline, particularly by introducing diversification earlier at retrieval.Lastly, it should be noted that the retrieval diversity metric is itself limited by the diversity of the content corpus.Diversification at retrieval and ranking cannot correct a lack of representation in the item corpus itself; this can be improved by sourcing more diverse content.In Sections 4 and 5, we discuss how we introduced diversification at the retrieval and ranking stages in our recommender systems.Triggering logic: A real-world system may receive search and recommendation requests that span a wide range of categories, such as fashion, beauty, home decor, food, travel, etc.The diversity dimension of interest depends on the application, for example, skin tone range diversification is applicable to fashion and beauty, but not to home decor.Thus, upon receiving a request, the system needs to determine whether to trigger diversification according to the dimension of interest.The triggering logic needs to account for the diversity dimension, the application, the production surface, and the local context, such as country and language, and can be based on heuristics or ML models, such as models that predict the category of a query.
Diversification and personalization need not be a zero-sum game; they may jointly contribute to an improved user experience.Depending on the application or surface, diversification may be expected in specific dimensions, while personalization may be desirable in others.For instance, for a query related to fashion in Search or Related Products surfaces, some users may expect to see personalized results in terms of the fashion characteristics of the Pins (e.g., clothing style, fabric, pattern) that relate to those of fashion Pins they have previously interacted with, and yet expect to see a variety of skin tone ranges represented in the human models wearing the fashion Pins in the result images.In this case, skin tone diversification allows users to explore fashion images representing a wide range of skin tones rather than a narrower set of results in a single visible skin tone range, creating a sense of inclusion and belonging.While some users may expect skin tone diversity in Related Products or Search surfaces, they may also expect their Homefeed to be attuned to their interests as reflected by the Pins they chose to interact with, and thus may not expect the same level of exploration and of skin tone diversity in their Homefeed recommendations as in other surfaces.In this work, diversification is triggered for categories and surfaces selected based on user feedback, user research, and data analysis on skin tone related Search query modifiers that highlight a need for diversity in similar requests.The triggers in this work include beauty and fashion categories in Search, Related Products, and New User Homefeed (the initial cold-start Homefeed surfaced to a new user).Longer term, one can envision a more advanced system that gives users control over the level of diversification of their results in various contexts through explicit user guidance or learned preferences.

DIVERSIFICATION AT THE RANKING STAGE
We start with a focus on the ranking stage to achieve diversification of results since it is the last stage of a recommender system and it has a direct impact on the metrics we aim to enhance.A basic approach to diversify at the ranking stage would be to boost or discount scores for content that is underrepresented according to a diversity dimension.While boosters are simple to implement, they tend to add to the technical debt if tuned only at the time they are introduced and if multiple boosters are not optimized together.Instead, we leverage a diversity-aware ranking stage that takes as input a list of items with utility scores and their diversity dimensions, and produces a ranking according to a combination of diversity and utility objectives.In this section, we describe two algorithms to achieve diversification through diversity-aware ranking: Round-Robin (Section 4.1) and Determinantal Point Process (Section 4.2).

Round-Robin (RR)
The first approach we used is a class of simple greedy rerankers that take in as input a list of items ranked by their utility scores and the item diversity dimension, to produce a diversified ranking.Given an ordered list of items  1 , . . .,   , we construct |D| number of ordered sub-lists corresponding to diversity dimension and containing items that have a utility score above the threshold.Then, we rebuild a ranked list by greedily selecting the top item of each sub-list one by one.All the items that do not belong to a sub-list, for instance, because they do not have a diversity dimension defined or have utility scores below the threshold, are ranked at the same position as in the original list.Figure 3 (a) shows an example where RR is used to diversify a ranked list of Pins with respect to four groups { 1 ,  2 ,  3 ,  4 }.The list is re-ordered such that the distribution of skin tone ranges is more uniform in the top positions, where users often pay the most attention.The first Pin stays in its original position, and RR cycles through the skin tone ranges picking the highest-ranked Pin for each sub-list one at a time.A modification to this algorithm could add randomization within a window of size |D| (i.e., 4 in the example) so that, in addition to diversification, it also helps preserve the user experience by ensuring that there is no fixed cyclic order.
In practical scenarios, the number of items in each sub-list will likely not be even, and RR may exhaust some sub-lists earlier than others.A few options can be used to handle such cases, such as simply skipping a sub-list when no more items from it are available (e.g., in figure 3(a) when RR attempts to select the 10 ℎ Pin but there are no Pins in the  3 sub-list).Another approach could be to merge some of the remaining sub-lists for a more even distribution between lighter and darker skin tones, i.e., merging the sub-lists for  1 and  2 , and the ones for  3 and  4 , and then continue RR by alternating elements between the two combined sub-lists.
RR is a simple, intuitive, and efficient approach to diversification, however, it does not balance diversity and utility and it does not easily generalize to multiple different diversity dimensions or multiple utility score thresholds.To avoid these limitations, in the next section, we describe a multi-objective optimization framework that can balance various utility functions and diversity.

Determinantal Point Process (DPP)
A Determinantal Point Process [20,22] is a machine learnable probabilistic model used in physics for repulsion modeling and more recently in recommender systems [32].Applications of DPP range from producing diverse samples of a large database [32] to characterizing various observed phenomena like the spatial distribution of fermions in optical beams, where they were originally introduced in [22].DPPs are particularly useful in ML for tasks such as subset selection, where the goal is to select a subset of points from a larger set that is diverse or representative in some sense.In this section, we give a brief overview of DPP (originally introduced in [22]) and how it can be applied for diversity-aware reranking.
The basic idea behind a DPP is to model the probability of selecting a set of items  from a set of size  as the determinant of a kernel matrix   , where  is a kernel function that encodes the utility of the items and the similarity between pairs of items, and   is the kernel matrix of the subset  .The determinant of   can be thought of as a measure of how spread out the points in  are in the feature space defined by the kernel function .The diagonal entry   represents the utility of the  ℎ item; in our case the score with which the items were originally ranked.The off-diagonal entry    , however, represents the similarity between the items, which in our case depends on the diversity dimension, e.g., the skin tone range in the Pin image.The kernel is chosen such that  is a positive semidefinite (PSD) kernel matrix and has a Cholesky decomposition, and hence  can be written as where  = diag(  1 , . . .,    ) is a diagonal matrix that encodes the utility   of each item,  is a parameter that governs the trade-off between utility and diversity, and where Φ  is the feature vector for the  ℎ item.For our use case, ΦΦ  is the symmetric similarity matrix, which we henceforth denote by .In terms of set selection in DPP, the probability of selecting a subset  is proportional to the determinant of   .
The log determinant is a weighted sum between a utility term and a diversity term balanced by the parameter  .Finally, given a value of  and kernel matrix , the goal is to find a subset  that maximizes the determinant of   : The use of a determinant means that based on the choice of kernel matrix,  would include items with high utility scores while avoiding ones that are similar to others in the subset.Finding such a subset  of a given size  is an NP-hard problem.However, because of its submodular property, it can be efficiently approximated using a greedy algorithm [17].In the greedy solution, we start with:  0 =  (empty set), then iteratively add one item at a time to the selected set using the following update rule: where  is a repulsion window size to only consider the last  items (i.e.,    in the equation) when computing the argmax.The sliding window size  is generally used in applications of DPP in recommender systems to make the optimization more efficient as the tolerance for similar items may increase as their distance in the ranking increases.
Figure 3(b) shows a hypothetical example of how DPP would re-rank as compared to RR given an appropriate value of parameter  .Note that setting  to a very low value would make DPP focus primarily on the diversity and hence make the ranking equivalent to RR.On the contrary, setting  to a very high value would make DPP focus on utility, and hence behave like utility-based ranking.During the implementation, the kernel matrix  can be learned using a deep neural network (e.g., in [32]), and  can be tuned, e.g., through a grid search with other hyperparameters using offline replay or through A/B experiments.
In comparison to RR, DPP takes into account both the utility scores and similarity and is able to balance their trade-off.For multiple diversity dimensions, DPP can be operationalized with a joint similarity matrix   to account for the intersectionality between different dimensions.This can be further extended to a function where, for each item, all diversity dimensions (skin tone, item categories, etc.) are provided and the return is a combined value that represents the joint dimensions.A simpler option is to add a diversity term in the weighted sum shown in equation 4 for each dimension.In the case of a large number of diversity dimensions, dimensionality reduction techniques can be used.

DIVERSIFICATION AT RETRIEVAL STAGE
The ability to diversify in the ranking stage is often limited by the availability of candidates from all groups in the retrieved candidate set.All the techniques proposed in Section 4 are limited to the set of candidates retrieved by the different candidate sources in the first stage, and hence, for specific queries, it may not be possible to diversify the ranking at all in the ranking stage.To tackle this limitation, we propose a set of techniques that increase the diversity of candidates at the retrieval layer to enhance the ability of the re-rankers to diversify at a later stage.This section presents three techniques: Overfetch-and-Rerank, Strong-OR logic, and Bucketized-ANN Retrieval.While Overfetch-and-Rerank is generally applicable for any retrieval stack, it has certain limitations that can make it impractical for many real-world diversification scenarios.Hence, we propose Strong-OR and Bucketized-ANN Retrieval to tackle these limitations for two specific use cases of retrieval.

Overfetch-and-Rerank at Retrieval
One of the simplest ways to increase the diversity of the candidate set at retrieval is to fetch a candidate set of a larger size (Overfetch).In this case, we may define the desired diversity criterion as the property that the candidate set contains a minimum threshold number of candidates from each group.For example, if we want to retrieve a candidate set of size  through K-nearest neighbor search in an embedding space, we could expand the neighborhood size to  ′ nearest neighbors ( ′ > ) such that the resultant candidate set has at least  candidates from each group.As the next step to only pass  candidates to the ranking stage, we can perform a Round-Robin selection of a subset of size  from this over-fetched set of size  ′ , for example, selecting one candidate at a time from each skin tone range until  candidates are selected (Rerank).However, as expected, the expanded size of the neighborhood  ′ to be explored is limited due to the increase in latency that the retrieval stage can afford.Hence, we choose a hyperparameter  max such that  ′ never exceeds  max .The overfetching will stop when either the minimum threshold in each group is met or when  ′ =  max .In Appendix A.3, we discuss the choice of this parameter.

Bucketized-ANN Retrieval
One of the most commonly used retrieval methods in an embeddingbased search index is the approximate nearest neighbor (ANN) search.For embedding-based retrieval, the users, items, and queries are all embedded into the same space, and for applications like search and recommender systems, the system wants to retrieve the items that are closest to the query or user embedding in terms of a chosen distance metric (e.g., cosine distance).Since computing pairwise distances for all query-item pairs is prohibitive in a practical recommender system, this nearest neighbor search is often performed using approximation algorithms that rely on efficient data structures, e.g., k-Dimensional Tree [3], Locality-sensitive Hashing (LSH) [16], and Hierarchical Navigable Small Worlds (HNSW) [23].In this work, we will refer to these methods as ANN search methods.Most of these approximation algorithms partition the embedding space into multiple regions and perform a search in it.In largerscale recommender systems where this search is over billions of items, these ANN methods are implemented as a distributed system, like Yianilos [33] for example.
For this work, we will follow the general architecture of an ANN search system that contains a root node that sends a request to a few leaf nodes that further request several segments to perform a nearest neighbor search in different subregions of the embedding space (as shown in Figure 4).Let's say there are  number of leaves  and  number of segments per leaf; to find  nearest neighbors for a given query embedding, each segment returns  potential nearest neighbor candidates to the corresponding leaf, which then aggregates these  ×  number of candidates to only retain the top  candidates, before passing it along to the root.The root is then responsible for choosing the top  candidates from the  ×  candidates whose exact distances are computed during the process.Note that the size of the graph, in this case, is  ×  × .
For the Bucketized-ANN Retrieval approach, we modify the aggregation step (at the leaf and the root level) to also aggregate top    candidates from each group   ∈ D into buckets corresponding to each of the groups under the diversity dimension (in addition to aggregating the top- candidates in the overall pool).In other words, each leaf now aggregates a set of  candidates, and |D| buckets with (at most)    candidates each.This helps preserve the top candidates belonging to each group (whose distances are already computed) from being dropped during the aggregation steps, without incurring the high cost of expanding the entire aggregation graph in the Overfetch-and-Rerank approach.

Strong-OR Retrieval
In Search, one of the critical components of the retrieval stage is query understanding, where the text query is converted to a structured query (s-query).It allows the retrieval system to specify relationships between different query terms using logical operators (such as AND, OR, XOR) to connect tokens in order to narrow down or broaden the set of results, e.g., a text query may be parsed to dress AND (red OR black).Since we would like to broaden the set of results to contain candidates with underrepresented groups under the diversity dimension, we use a specialized logical operator for search called Strong-OR [11].
On a search index, Strong-OR operates similarly to the OR operator except that it prioritizes a candidate set that satisfies multiple criteria simultaneously.More concretely, like an OR operator, we can specify that the retrieved candidates must belong to either  1 or  2 , but in addition, we can also specify what (minimum) percentage of candidates match each of the respective criteria.In an s-query, Strong-OR expresses the disjunction semantics across all children of the parsed query (e.g., Figure 5), and each child node in the squery can optionally be required to match ≥   of all candidates.If there are insufficient candidates to fulfill the criteria specified, it will match as many as possible.
Given an early stop parameter  limiting the maximum number of candidates to be scanned rather than scanning the entire corpora, a criterion denoted by Γ (e.g., in Figure 5, Γ is term1 ≥ %,   = 5 and  = 10), Strong-OR fetches  candidates that match the query.As we scan the list from left to right, Strong-OR acts as a regular OR at first, i.e., if it naturally satisfies Γ, the set is returned.Otherwise, during the scanning, Strong-OR promotes Γ to be a required criterion; for example, in Figure 5, after position 3 in the list, Γ becomes a necessary condition, and hence candidates 6 and 9 (in red) are retrieved instead of 4 and 5. Since Strong-OR happens in the retrieval stage during the query understanding phase, we can also add the candidates, that satisfy Γ and would not have been retrieved otherwise, into dedicated buckets (like in Bucketized-ANN Retrieval) to ensure that they are not dropped in the latter stages of retrieval.

PRODUCTIONIZATION CONSIDERATIONS FOR A LARGE SCALE RECOMMENDER SYSTEM
We implemented and deployed diversification approaches in a largescale recommender system with over 460 million monthly users [15], who use Pinterest to find visual inspiration for their interests and simultaneously find products that fit their needs.We chose three different surfaces on the platform based on user feedback to diversify specific experiences -namely Search, New User Homefeed, and Related Products.On Search, users enter a text query to find Pins that match their intent.The Related Products surface recommends a list of Pins similar to the Pin selected by the user (query Pin).All Pins recommended on this surface are products1 available for purchase by users.Lastly, New User Homefeed is the initial Homefeed that a new user sees after signing up.These surfaces were consciously chosen keeping in mind user research and data analysis of user needs as mentioned in Section 3.2.In the rest of this section, we present multiple practical considerations to deploy our diversification approaches in a real-world production system.

Indexing
Deploying diversification algorithms at retrieval requires indexing the diversity dimension of Pins, e.g., the Pin skin tone range, in both embedding-based and token-based indices.For each Pin, the skin tone range (if applicable) is computed offline using a computer vision model.An offline batch workflow periodically reads the skin tone predictions generated for each Pin from a store and adds it to the indexing pipelines of each surface for fast retrieval.The indexed diversity dimension can be passed along with the candidate Pins to the ranking stage for ranking diversification.Alternatively, the ranking stage can read the Pin diversity dimension from stores or a caching service.Once the Pin diversity dimensions are available in the serving infrastructure, we implement various diversification algorithms at retrieval and ranking stages and run online experiments to assess their performance in different geographic markets.

Latency and Scaling Considerations
One of the main advantages of using RR is its implementation simplicity as a post-ranking step, even for a complex recommender system, and its minimal impact on latency due to the linear time complexity.However, because of the use of sub-lists for each group, it is hard to scale when new diversity dimensions need to be incorporated.For example, if we desire to diversify for skin tone as well as the category of the Pins (e.g., home decor, fashion, beauty), the number of possible combinations may become impractical for RR.A possible solution to this problem is to use priority queues to iterate over these multiple dimensions.Another downside of RR is that it does not clearly balance the trade-off between diversity and utility scores.We also need to define how the diversification algorithm handles Pins that do not belong to a specific group in the diversity dimension, for example, images that do not contain skin do not have a skin tone.These cases can be handled by either leaving them at their original position and only diversifying the Pins where a skin tone is detected, or by randomly sampling and assigning a diversity dimension to those Pins and having them be part of RR.Lastly, it may be beneficial to add a threshold of how deep we consider Pins for RR, either by position or the utility score.Limiting the set of Pins we perform RR over helps ensure a lower impact on the overall latency.In addition, it also takes into consideration that Pins that were originally ranked lower may be less relevant to users.That being said, it is crucial to assess the fairness of the ranking models before thresholds are used as Pins from certain diversity dimensions may be disproportionately ranked higher or lower by some ML models.
For DPP, the similarity matrix is computed at serving time for the list of Pins in the ranking.While the greedy iteration computes the diversity term at each step, the utility scores can be cached during the DPP iterations.Given that the time complexity of DPP is  ( ) [7], we apply a few techniques to reduce the impact on latency, that can be optimized and evaluated through offline replay, shadow testing, or A/B experiments for each surface: • Tuning the batch size and the window size: Given a list of  utility-ranked Pins, instead of DPP re-ranking the entire list of size  , we can diversify the ranking of Pins only up to a certain position  (the batch size) or above a certain score threshold   , over sliding windows of size  <  .For example, one may only consider diversifying the median (or a percentile above the median) scrolling length in the surface.• Tuning the depth size: While diversifying a batch of size , we can seek diverse Pins all the way to the depth size  ′ where  ≤  ′ ≤  to generate  diversified results.Both the batch size and the depth size reduce the computation required to rerank, but the increase in diversity in top-k could be limited by the availability of diverse content in the explored depth.• Batch parallelization: Building on the concept of batch size, for some surfaces where users scroll deeper, we can also diversify multiple batches of size .This way the set of Pins is diversified throughout the ranking while limiting the initial loading time for the first set of Pins.

Qualitative evaluations
To evaluate the diversification of results using skin tone, we collected qualitative feedback from a diverse set of internal participants for every iteration.We presented them with a side-by-side comparison of results before and after diversification for different conditions in our A/B experiments and asked them to rate the results based on diversity and relevance.We also collected relevance evaluations through professional data labeling, where raters evaluated the relevance of results for a sample of queries.Additionally, to account for the local context in international markets, we collaborated closely with the internationalization team for a qualitative assessment of diversification and its results in various markets.These inputs were extremely important in making the final decision to launch our approaches in international markets.

RESULTS IN PRODUCTION
To study the impact of diversification on business metrics, including user engagement, and the impact on the diversity metric, we ran several A/B experiments on three surfaces on the Pinterest app: Search (for fashion and beauty-related queries), Related Products (for fashion Pin recommendations), and New User Homefeed (fashion and beauty category Pins).
There are several nuances that must be taken into consideration when measuring the success and implications of these approaches in search and recommender systems.First, appropriate metrics and guardrails must be set in place before performing diversification.Second, while some of the learnings are transferable between surfaces, each surface presents unique challenges and may differ drastically from past use cases.The differences between surfaces encompass but are not limited to active users, Pin corpora, business metrics, and surface goals.Because of these factors, comparing the change in the diversity metric between surfaces is a moot point as even the data used in each surface is often different and sometimes disjoint.Nonetheless, we often observed positive gains in diversity metrics coupled with neutral or positive impact in guardrail business metrics for all the techniques described in this paper.It is also worth noting that not all surfaces require the same types of interventions so techniques described in earlier sections were applied to surfaces when appropriate.
The metrics reported in this paper are the result of several A/B experiments we ran in production for at least 3 weeks in the US as well as international markets.The number of users varies per surface, Search and Related Products both had a few million users per experiment group, while New User Homefeed had hundreds of thousands of users per group.In the rest of this section, we give a brief overview of the impact of these techniques on user engagement metrics and the diversity metric (Div@ ()) (we provide more details on the choice of  in Appendix A.4).We report the impact to these metrics as the percentage difference relative to control.Any impact on engagement metrics reported below was statistically significant with p-value < 0.05.

Search
In order to diversify search results with respect to skin tones, we first adopted RR with a score threshold for queries in the beauty and fashion categories in an A/B experiment that ran for five weeks in October 2020.This approach led to a 250% increase in the diversity metric and had a positive impact on engagement.We further iterated on the ranking stage by replacing RR with DPP in an experiment that ran for three weeks in April 2021, which resulted in a minor impact on the diversity metric while improving engagement and user growth metrics.The DPP approach was also launched in some international markets where we saw a similar trend in metrics, with gains in the diversity metric ranging from 200% to 400% for different countries, with respect to the non-diversified baseline.We also observed an improvement in daily active users (number of users who made at least one request to Pinterest from any device), weekly active users (derived from daily active users by looking at action type for at least 7 days), and overall time spent on the platform.
To diversify in the retrieval stage, we deployed the Strong-OR logic to improve the diversity of the Pins retrieved.The experiment was run for a period of four weeks in August 2022.Adding this retrieval logic along with existing DPP at ranking resulted in an additional 14% increase in the diversity metric as well as a positive impact on search engagement.Figure 6 shows a visual comparison of search results before and after diversification.

Related Products
In Related Products, after assessing the various cases and Pin categories where it would be appropriate to start deploying diversification, we introduced diversification in the ranking and retrieval stages for fashion and wedding-related Pins.For ranking stage diversification, we initially ran an A/B experiment with RR to diversify results post-ranking for the treatment group.This experiment was run for a period of ten weeks between September and November 2020.We observed a 270% improvement in the diversity metric Div@ = 10() and a neutral impact on relevance and engagement for the treatment group compared to the control group.With the increase in diversity in the top-ranked recommendations, we observed that the diversity also increased in terms of the skin tone distribution of the Pins that users engaged with.This experiment led to a successful launch of RR to all users viewing the fashion and wedding Pins.
To better balance ranking scores and diversity, we conducted another A/B experiment comparing RR (as control) to DPP (as treatment) for a period of four weeks during March and April 2022.Since DPP tries to balance utility scores of the Pins with diversity as compared to RR that reorders Pins only on the basis of the diversity dimension, as expected, the treatment group saw a small decrease in the diversity metric Div@ = 10(), while some shopping metrics like the proportion of users with purchases increased by 1.3%, and engagement metrics -such as clicks, long-clicks and savesincreased by more than 5%.At the end of the experiment, DPP was deployed to all users as part of the default Related Products experience.
To introduce diversification at the retrieval stage, we implemented the Bucketized-ANN Retrieval techniques to enhance the diversity of the retrieved set.In an experiment (that ran for 11 weeks between December 2022 and March 2023), Bucketized-ANN Retrieval led to an increase in the diversity metric (of the entire candidate set) by 8% at the retrieval stage for the nearest neighbor search in a Pin embedding space, while the relative increase in the diversity metric was about 1%.The relatively small increase due to retrieval diversification means that more work is needed to tune the hyperparameters at the ranking stage with respect to the diverse retrieval set.
Prior to each launch, as outlined in Section 6.3, we conducted qualitative evaluations to compare the relevance of top-ranked recommendations and observed no significant changes in relevance or recommendation quality.Figure 1 shows a visual comparison of Related Products results before and after DPP-based ranking combined with Bucketized-ANN Retrieval-based diversification.

New User Homefeed
We introduced diversification as part of the new user experience so that everyone feels represented from their first interaction with the platform.We initially developed a two-dimensional RR variation, which prioritized Pin category diversity using a category RR and achieved best-effort skin tone diversity using a priority queue.Leveraging a frequency-based skin tone priority queue, it greedily selected the next Pin at each step in the re-ranking, so that skin tone ranges with lower frequency across topics were given higher priority.In an A/B experiment run over six weeks in April to June of 2021, we deployed this approach for a subset of skin tone-related categories within the beauty and fashion categories.The skin tone diversity metric increased 109% with a neutral impact on Pin category diversity, engagement, and growth metrics.Iterations on our ranking system led us to replace the two-dimensional RR with a single skin tone based RR in the ranking stage, that operated over all Pins with a skin tone range across categories.
This experiment that ran for four weeks in September 2021 led to a 650% improvement in the diversity metric as compared to the non-diversified experience with a neutral impact on engagement.Finally, we introduced diversification at the retrieval stage using Overfetch-and-Rerank in an experiment that ran for four weeks in March 2022, which increased the skin tone diversity metric by 63%, with a neutral impact on latency.In the latest deployment, DPP diversification was introduced in the ranking stage (the experiment was run for four weeks in November 2022), leading to a 462% increase in the diversity metric due to retrieval diversity alone.Ultimately, the combination of DPP and Overfetch-and-Rerank achieved the best balance in terms of diversification and utility for New User Homefeed.

ETHICAL CONSIDERATIONS
Skin tone diversification aims at improving representation by surfacing all skin tone ranges in the top results when possible.While the visible skin tone ranges in Pin images are leveraged to surface all skin tone ranges in the top results at serving time, they are not used as inputs to train ML ranking models.It is important to note that skin tone ranges are Pin features, not user features.We respect the user's privacy and do not attempt to predict the user's personal information, such as their ethnicity.

CONCLUSIONS AND FUTURE WORK
We addressed the challenge of diversification to improve representation in search and recommender systems using scalable diversification approaches at ranking and retrieval.We deployed multi-stage diversification in a large-scale production system with hundreds of millions of users, and through extensive empirical evidence showed that it is possible to create an inclusive product experience that positively impacts utility metrics such as engagement.We shared the learnings from deploying the first visual skin tone diversification in a visual discovery recommender system to the best of our knowledge.Our techniques are scalable for multiple simultaneous diversity dimensions and can support intersectionality.Future work includes developing more advanced and scalable triggering mechanisms, for instance through a model that learns which requests to diversify for the diversity dimensions of interest based on the context, including the query, surface, category, and language.In ranking, the multi-objective optimization weights that balance different objectives including diversity could be adapted over time in an automated manner.Future work in retrieval diversification can take inspiration from recent research in debiasing word embeddings [4] and fair representation learning [36] to ensure that the underlying representations in embedding-based retrieval are fair for relevant diversity dimensions.We can analyze how diversified search results and recommendations can help mitigate serving bias in systems that generate their own training data, by creating a positive feedback loop for model retraining thanks to richer interaction data from a diverse set of Pins.Finally, we can evaluate the potential impact on the diversity of Pins in the corpora in the long term.k Δ(Div@ ( prod )) Δ(num queries)

A CHOICE OF HYPERPARAMETERS
We tune the hyperparameters for RR, DPP, Strong-OR, and Overfetchand-Rerank using both online experiments and offline evaluations.

A.1 Round-Robin hyperparameters
In RR, score thresholds allow mitigating potential impact on utility metrics.We evaluated different values of the score threshold to determine which Pins should be included in the RR logic.The Pins that did not meet the threshold were appended towards the end in the same order they were ranked in the utility-based ranking list.

A.2 Determinantal Point Process hyperparameters
For DPP, we iterated on the different parameters like window size , utility-diversity trade-off parameter  , batch size , and the similarity function (Sections 6 & 4.2). is usually based on how much user fatigue there is when similar Pins show close to each other.If |D | is relatively small it is possible to set this value to |D| itself or multiples of it, but for large values of |D| further testing is necessary to properly set .The batch size is the position up to which we diversify the ranking (instead of the entire candidate set) to reduce the computation during DPP.We experimented with batch sizes of 200, 400, and 800.Next, the value of  determines how much weight we want to give to the utility score as compared to diversity, if  is increased by a lot, the greedy optimization of DPP becomes unstable.For kernel transform, we tried two types of kernel: Identity and Radial basis function (RBF).The latter performed better on our data in terms of diversity.Lastly, for the similarity function, we experimented with mathematical variations of exponential, linear, and cosine similarities when comparing the skin tone ranges of two Pins.Through our experiments, we find that in several surfaces, the linear similarity performed closer to RR in terms of diversity, however the choice of this parameter may vary for each surface, given the Pin corpus varies between them.

A.3 Overfetch-and-Rerank hyperparameters
We experimented with various values for the maximum overfetching parameter  max for Overfetch-and-Rerank in New User Homefeed.The results revealed that setting the overfetching upperbound  max to twice the candidate size offered a good balance between latency and diversity for the surface.
A.4 Choice of  in Div@ () We consider two factors when choosing a value of  for the diversity metric Div@ (): maximizing the coverage in terms of queries, and observing perceptible changes in the diversity of the rankings.In Search and Related Products surfaces, we choose  = 10 in the Div@ () and  = 6 for New User Homefeed to balance the two factors.While choosing a lower value of  makes it harder to satisfy a diversity constraint, choosing a higher value of  reduces the amount of data we can use to compute the diversity metric because not all users view impressions up to a large .To make this decision, we collected Pin impressions logs from the recommendation surfaces to study the relationship between  and our metric Div@ () for our production ranking system  prod .In Table 1, we show some of these values for the Related Products surface: the relative change in the diversity metric (fraction of queries with all the skin tone ranges represented in the top-k) and the relative change in the number of queries with at least  Pin impressions that contain a skin tone range.As expected, choosing a higher value of , increases the diversity metric but reduces the number of queries (or requests) used to compute the metric itself which may lead to differences in the metric value.Hence, we decide to choose  = 10 for this surface as there is sufficient scope for improvement in the diversity metric possible within the first ten Pins while we also have enough number of queries to compute the metric.

Figure 1 :
Figure 1: Side-by-side Related Products recommendations for the query Pin "Shirt Tail Button Down" shown in the center.Left: previous experience without diversification.Right: current diversified experience.

Figure 2 :
Figure 2: Large-scale recommender systems can broadly be categorized into two stages going from items corpus to recommendations: retrieval and ranking.

Figure 3 :
Figure 3: Illustrative example of Round-Robin and DPP applied to a utility-ranked list.Each block is a Pin in the ranked list and the color denotes the skin tone range of the Pin image (  ∈ D).(a) Ranked list obtained after applying Round-Robin is re-ordered such that the distribution of skin tones is more uniform in the top positions.(b) Ranked list obtained after DPP (for a specific value of  ) allows for optimizing a list-wise objective to trade off utility and diversity of the initial ranked list.

Figure 4 : 3 term2Figure 5 :
Figure 4: A diagram of distributed ANN retrieval aggregating candidates from segments to leaves to the root based on the distance metric while assigning top Pins with each skin tone to their corresponding buckets.

Figure 6 :
Figure 6: For the query "pink nails matte" on Search, (a) shows search results without any diversity, (b) shows diversified search results using RR with a score threshold, and (c) shows the diversified ranking for the same query using DPP.

Table 1 :
Comparing the relative change in the diversity metric and number of queries for different choices of  based on the impression log.For this work, we choose  = 10 as the depth of ranking in our metrics.