SPM: Structured Pretraining and Matching Architectures for Relevance Modeling in Meituan Search

In e-commerce search, relevance between query and documents is an essential requirement for satisfying user experience. Different from traditional e-commerce platforms that offer products, users search on life service platforms such as Meituan mainly for product providers, which usually have abundant structured information, e.g. name, address, category, thousands of products. Modeling search relevance with these rich structured contents is challenging due to the following issues: (1) there is language distribution discrepancy among different fields of structured document, making it difficult to directly adopt off-the-shelf pretrained language model based methods like BERT. (2) different fields usually have different importance and their length vary greatly, making it difficult to extract document information helpful for relevance matching. To tackle these issues, in this paper we propose a novel two-stage pretraining and matching architecture for relevance matching with rich structured documents. At pretraining stage, we propose an effective pretraining method that employs both query and multiple fields of document as inputs, including an effective information compression method for lengthy fields. At relevance matching stage, a novel matching method is proposed by leveraging domain knowledge in search query to generate more effective document representations for relevance scoring. Extensive offline experiments and online A/B tests on millions of users verify that the proposed architectures effectively improve the performance of relevance modeling. The model has already been deployed online, serving the search traffic of Meituan for over a year.


INTRODUCTION
Unlike traditional e-commerce services that mainly provide products [7,15], most users on life service e-commerce platforms , like Meituan 1 , would like to search for service providers instead, such as restaurants, bars, stores, hotels, etc.These two forms of documents differ in that a service provider usually contains a great number of structured information therein.For example, a restaurant may have structured fields like name, address, category, comments, tags, and services like dishes and coupons.Considering all theses different types of structured contents in relevance matching is a challenging task.
To model the semantic relevance between query and document, previous methods tend to use a two-stage training paradigm [8].At the first stage, a relevance model is first pretraining on a large-scaled dataset with self-supervised tasks.Afterwards, the model is further trained on a downstream dataset for task-specific finetuning.
However, previous e-commerce relevance models are less effective for documents with rich structured contents [22].
Specifically, at pretraining stage, existing relevance methods are normally designed for product search, and each document has only limited fields [24].However, in life service scenario, there are various fields in each document, and each field may have its own structure.Under this circumstance, it is impractical to directly input all the contents in a document for pretraining model, let alone the distributional discrepancy among different fields.
At finetuning stage, the models used in relevance task can be categorized into two architectures: cross-encoder and bi-encoder.Crossencoder models better capture the interactions between query and document, and thus offer better performance.In e-commerce search scenario, bi-encoder models are widely used, since this architecture can cache the representations of query and document, and thus reduce calculations at online inference stage [12].However, bi-encoder methods usually get lower performance compared to cross-encoder, due to the lack of interactions between query and document.
In this paper, we propose novel pretraining and matching architectures for relevance modeling in e-commerce search, especially for documents with rich structured contents.First, based on the characteristics of relevance task and document structure, we propose an effective pretraining method that employs both query text and multiple fields of document as inputs.Specifically, this method first extract basic fields of a document, such as name, brand, category, address, etc.Then, for lengthy fields like coupons or dishes, a field compression method is designed to effectively extract the pivotal information therein.Considering the vocabulary gap between query text and document, we also add document-related query to make the pretrained model better understand query.
After that, we design different mask strategies for each field to better adopt the word distribution of their own.Experiment shows that the proposed pretraining method brings better performance on relevance scoring with cross-encoder models, proving the effectiveness of modeling the rich structured information in documents.However, when we use bi-encoder model instead, the performance was not improved, and even worse than its original version that use query and name field only.It means that the document representation obtained by this model cannot effectively capture the rich information contained in the document.To solve this problem, in this paper we use query intent signals in search engine and design information extractors to generate better document representation.This method enables the query and document information to interact earlier, i.e, in the process of generating document representation.Therefore, it can effectively extract the information most beneficial for relevance matching.
To summarize, the contributions of this paper are as follows: • We propose an effective multi-field pretraining architecture for e-commerce search with rich structured documents, and design a field compression algorithm to extract pivotal information from lengthy fields.

RELATED WORKS 2.1 Pretraining Models
In recent years, pretrained models like BERT [2], RoBERTa [11], XL-Net [21] have brought great performance improvements on many NLP tasks.These models are first pretrained on large-scale unsupervised data, and then fine-tuned on task-specific data for downstream tasks.BERT is a representative work among them.It obtains text representations by using transformer [18] architecture and training on two tasks -MLM (Mask Language Model) and NSP (Next Sentence Prediction).Recently, [11,21] prove that the NSP task is not indispensable.Since pretrained models usually do not model knowledge of specific downstream tasks, many works have begun to study how to encode domain knowledge into these models.Gururangan et al. [3] pointed out that continuing pretraining with domain data can gain performance improvements in domain-specific downstream tasks.Zou et al. [25] proposed a pretraining paradigm for relevance modeling in web search.Zhang et al. [23] proposed a pretraining work incorporating multi-modal information in e-commerce scenarios.Zhang et al. [24] proposed a e-commerce pretrained model with phrase mask and document neighbor for augmentation.Despite the success these models made in their domains, there is still lack of pretraining methods designed for documents with rich structured contents, which is very common in life service e-commerce scenario.

Text Matching
Text matching (or text relevance) task computes the similarity between two sentences.In the recent decade, neural methods have been flourishing for their better capability to model semantic similarity.Neural text matching methods can be categorized into two types: cross-encoder methods [1,13,16,20], and bi-encoder methods [4,5,14,19].They differs in that the former type concatenates two sentences together before feeding them into the model to calculate similarity, while the latter one first gets representations for each of the two sentence before calculating similarity score between them.The advantage of cross-encoder methods lies in that information of the two sentences can be well interacted, so these models normally have better performance.The advantage of bi-encoder models lies in that the trained sentence representations can be offline cached, which greatly speeds up online inference of the model.Recently, BERT has shown its superiority on text matching task.To increase interactions between sentences for bi-encoder BERT methods, Reimers et al. [17] proposed Siamese BERT networks that use a shared transformer for both of the sentences.Khattab et al. [9] proposed ColBERT that uses a sum-of-max operator to facilitate interaction between representations of the two sentences.Humeau et al. [6] use additional parameters to extract multiple representations of the longer sentence.However, these methods treat two sentences as plain text, while in industrial search engines both side of the match have rich additional knowledge to be used.

METHODOLOGY
In this section, we demonstrate the details of the proposed methods.First, we introduce the pretraining model designed for structured document (Structured Pretraining BERT, i.e.,SPBERT), including two parts: data construction and pretraining strategy.Then we introduce the matching architecture for structured information, and propose two information extractors to better match query and structured documents in e-commerce scenario.

Pretraining Model for Structured Document
We formally define the  ℎ document with structured information as denotes the  ℎ token in W    , and |  | denotes the number of tokens in W    .Considering the fact that fields (such as name, brand, address, category, products, etc.) of a document in e-commerce scenario are normally not independent of each other, we define the following objective function for pretraining: where      ≥ 0 denotes how many times token      is sampled to be masked for MLM [2] pretraining task.
The model structure is illustrated in Figure 1.In order to distinguish different fields which are concatenated together as input, we use a separate segment embedding for each of them.Besides, we use position embedding starting from 0 for query and document respectively, so that the words in the same position of query and name will get the same position embedding.
In this section, we introduce the proposed matching framework with structured information extractors.

Matching Framework.
As depicted in Figure 2, the framework uses a bi-encoder architecture to encode tokens of query and document respectively and obtains their representations.The advantage of bi-encoder architecture is that the output embeddings of the document can be calculated and cached in advance to speed up online serving [22].A representation extraction layer is designed to extract the most important information beneficial for matching from the sophisticated structured input.Details of the structured  After that, a representation compression layer is adopted to reduce dimension of the extracted representations.Then, relevance score between query and doc can be calculated through the matching layer.

Intention-Guided Extractor(IGE).
In this section, we further propose an intention-guided information extractor, as depicted in Figure 3.It utilizes the intention signal of the query to be matched in advance, and extracts information according to the guidance of this knowledge.Specifically, we use the query understanding (QU) module in search engine to provides query intention signal, and then extract the information in the structured document according to the signal.
Formally, a group of intention-guided extractors is used to extract query intention specific representation from document with  fields.Here,  denotes the total number of intention types, ℎ   and ℎ  ′  respectively denotes the corresponding extractors that the  ℎ intention is hit or not, ℎ    denotes the representation after SPBERT encoder of the  ℎ token from  ℎ field.The resultant representations of extractor ℎ   and ℎ  ′  are denoted as ĥ  and ĥ ′  respectively, which are defined as follows: Where attention(,  ) = softmax(  ) .Additionally, we also use a group of  global extractors in the same way in Eq.2.The output of the intention-guided extractor layer is Then, a compression layer is utilized to obtain a memory-friendly version of representation for deployment.Specifically, we adopt a full-connected network to project the extracted representation from a high dimension to a much lower dimension.For simplicity, we use the same notations for representations before and after compression layer.Note that these are embeddings we will pre-calculate and cache when we deployment the model.
In matching layer, we first index out intention related representations from  module in search engine.The selected intention related representations are noted as [ ĥ Ĩ1 , • • • , ĥ Ĩ ].Note that although we will cache 2 *  + embeddings for each document, only  + embeddings are involved during matching.By aggregating these representations using attention, we get the final document representation: ℎ is the representation after the compression layer.At last, cosine similarity is used to calculate the relevance score between query representation and document representation: We use the following loss function to characterize relevance at a finer granularity, since the label is multi-leveled: where  ∈ [0, 1] denotes the normalized relevance label. is the margin hyper-parameter, which is set to 0.1 in our experiments.

EXPERIMENT
To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on an industrial life service e-commerce search engine.Experimental results show that for e-commerce search with rich structured documents, the proposed pretraining and matching architectures can significantly improve the performance of relevance modeling.

Evaluation Metrics
We use the following evaluation metrics to evaluate the performance of the proposed method.The Area Under Curve(AUC) is widely used as the evaluation metric in e-commerce scenarios [7,22] for evaluating relevance model .If the relevance score given by the model is ideally all higher on the relevant samples than the irrelevant ones, AUC will get the maximum value of 1. Traditional AUC formula is only suitable for binary classification, but our relevance matching task has three relevance levels.So we employ a multiclass AUC formula: where    ≥ 0 is the relevant label for the  ℎ returned document of the  ℎ query,    ∈ [0, 1] is the model score for the  ℎ returned document of the  ℎ query and  is the total number of test queries.The Badcase@5 is an important metric we used to evaluate the quality of the top search results of the online system.Specifically, it calculates the proportion of queries having irrelevant cases in the top 5 ranking results:

Competitor System
To evaluate the effect of the proposed structured pretraining method on relevance task, we compare the following Cross-Encoder methods.We use + to denote that the models use a query and multiple fields of a document as inputs.Otherwise, the inputs consist only of the query and the name field of the document.
• BERT-Large-CE: a cross-encoder matching model based on BERT-Large (24 layers) with query and document name as input.
• BERT-Large-CE + : the same model structure with BERT-Large-CE, but with query and structured document as input.
Fields of the document are concatenated with [SEP], using the same segment embedding.• BERT-CE + : a cross-encoder matching model based on a 6-layer BERT model distilled from BERT-Large, using query and structured document as input.• SPBERT-Large-CE + : a cross-encoder matching model based on SPBERT-Large, the 24-layer structured pretraining model.Using query and structured document as input.• SPBERT-CE + : a cross-encoder matching model based on a 6-layer SPBERT model distilled from SPBERT-Large, using query and structured document as input.
To improve model efficiency on online serving, relevance models in e-commerce scenario often adopt Bi-Encoder architecture.This architecture can cache representations of queries and documents in advance and thus reduce online calculation.We further evaluate the performance of the pretraining model by comparing it with methods of this architecture.
• BERT-BE: a bi-encoder matching model based on BERT with query and document name as input.• SPBERT-BE: a bi-encoder matching model based on SP-BERT with query and document name as input.• SPBERT-BE + : the same model structure with SPBERT-BE with query and structured document as input.
To evaluate the effectiveness of the proposed matching methods with structured extractors, we compare them with state-of-the-art late interaction methods for comparison.
• SPBERT-SBERT + : SBERT [17] architecture using SPBERT as pretrained model.As one of late interaction methods, SBERT concatenate input query embedding, document embedding, and simple mathematical operation result between them before maching.In this way, this method usually gets better results compared with other methods directly matching query embedding and document embedding.• SPBERT-ColBERT + : ColBERT [9] architecture using SP-BERT as pretrained model.Since ColBERT model calculates the matching score of each token of a query to all tokens of a document, it tends to gain better performance than other biencoder methods.However, this method needs to cache the token representations of all documents beforehand, which is memory expensive and hard to deploy when documents have long length.• SPBERT-PolyEncoder + : PolyEncoder [6] architecture using SPBERT as pretrained model.PolyEncoder uses multiple codes to interact with document to obtain effective document representations.By controlling the number of codes, this method can flexibly trade off between memory usage and model performance.• SPBERT-IGE + : a matching method with the intent-guided extractor elaborated in section 3.1.2,using SPBERT as pretrained model.This method extracts the information of structured documents for all types of intent signals.

Experimental Setting
For all the experiments, we set the learning rate to 2e-5, warm-up ratio to 10%, dropout rate to 0.1, and use the Adam [10] optimizer.
In the pretraining stage, the model parameters are set to 24 hidden layers, 16 attention heads, hidden size of 1024, and feedforward layers with dimension 4096; the batch size is set to 20 during training, and the model is distributedly trained on 64 NVIDIA V100 GPUs for 10 epochs.
In the distillation stage, we use the pretraining model as the teacher.The parameters of the student model are set to 6 hidden layers, 12 attention heads, a hidden size of 384, and feed-forward layers with dimension 1200.During training, the batch size is set to 8 and the model is distributedly trained on 32 NVIDIA V100 GPUs for 3 epochs.
In the finetuning stage for relevance task, we set the sequence length to 160 for cross-encoder models.As to the bi-encoder models, we set the maximum query length to 32, and the maximum document length to 128.Before the matching layer, we use a onelayer fully-connected network as a compression layer to project the dimension of embeddings from 384 to 32.For SPBERT-IGE + , the number of intent-guided extractors is set to 3 corresponding to the number of query intent types we use, and we adjust the number of global extractors to meet the requirements on the total number of extractors in each experiment.The embeddings of all extractors are random initialized before training. in equation ( 5) is set to 0.01.Batch size is set to 256 during training.Early stop is performed when AUC does not improve within 5 epochs.
To the best of our knowledge, there is no public e-commerce relevance dataset that has highly structured documents as well as query intent signals.Therefore, we report the evaluation results of the proposed method on the dataset described in 4.1.2for all offline experiment.

Offline Experimental Results
Table 1 reports the experimental results of the base model BERT-Large, the structured pretraining model SPBERT-Large, as well as the corresponding distilled models BERT and SPBERT, using all fields or name field as document input respectively.Comparing BERT-Large-CE with BERT-Large-CE + , we can observe that adding structured inputs to unstructured pretraining models is harmful for their performance.This may be attributed to the distributional discrepancy among different fields of structured document, which can not be distinguished by unstructured pretraining models.On the contrary, the comparison between BERT-Large-CE and SPBERT-Large-CE + informs us that using structured input with structured pretraining model does not lower model performance, but instead improves AUC by as much as 1.8%.This result reveals that adding structured information is beneficial for relevance tasks.Moreover, this benefit can only be achieved by using structured pretraining.Finally, by comparing the distilled model BERT-CE and SPBERT-CE + , it can be seen that the advantages of structured pretraining are still preserved after distillation.
To improve model efficiency, e-commerce relevance methods often use bi-encoder architecture to reduce real-time computation.Table 1 reports the performance of BERT and SPBERT of this form.First, comparing SPBERT-BE with BERT-BE, it can be observed that SPBERT can still achieve better results even using unstructured input.This is because the term-weight-based masking strategy employed on name field at pretraining stage enables the words important to the matching task to be better learned.Note that by comparing SPBERT-BE with SPBERT-BE + , we can see that the performance of bi-encoder model decreases when using structured input.This is due to the fact that structured document includes multiple fields with different importance to the relevance task, so it is difficult to extract enough effective information by just using pooling.This observation inspires us to design the more effective information extraction methods based on the characteristics of structured document.To prove the effectiveness of the proposed matching architectures, we compare them with state-of-the-art late interaction biencoders.In consideration of fairness, the number of codes in PolyEncoder and extractors in SPBERT-IGE + are all set to 8. As shown in Table 2, late interaction methods alleviate the problem of performance degradation after adding structured information in bi-encoder.Moreover, we observe that SPBERT-IGE + has better performance than SPBERT-SBERT + and SPBERT-PolyEncoder + .This indicates the proposed extractor can better extract informative representations for matching than those that extract information completely on the model itself.Note that SPBERT-IGE + has reached close performance to SPBERT-ColBERT + , while it at the same time needs much fewer document representations that need to be cached than the latter, meaning that the proposed method is both effective and memory-friendly.We conduct an online A/B test for one week to ensure that our new proposed model (i.e.,SPBERT-IGE + ) will improve the system performance compared with the old one (i.e., BERT-BE).The results show that the new model can largely improve the overall user experience of the e-commerce search system.In particular, the Badcase@5 metric has decreased by 1.12% which is statistically significant with  < 0.05.This show that the structured pretrain and matching architectures with intent-guided extractors are very helpful for improving the relevance of the search system.

A.1 Importance of Each Field in Structured Input
To investigate the impact of each field in the structured pretraining model for relevance task, we compare the performance of the following variants: models with unstructured or structured input, and models that drop each field, e.g.category, keywords, address, brand.As shown in Figure 5, we can find out that each field has its contribution to the performance of the model, and the contribution of each field varies.

A.2 Effect of the Number of Extractors
We conduct experiments on how the number of extractors influences the performance of the proposed models.For PolyEncoder, we tune the number of extractors to reach the target number.For the proposed models, we tune the number of global extractors to make the number of all extractors reach the target number.The results shown in Figure 6 show that the performance of both models improves and finally achieves similar results as the number of extractors increases.However, the number of extractors cannot be increased infinitely, since the required storage size to cache document embedding grows linearly as the number of extractors increase and there are usually tens of millions of documents in real search system.Note that SPBERT-IGE + significantly outperforms SPBERT-PolyEncoder + when the extractors number is relatively small(e.g. 4 or 8).In online system., we deploy the SPBERT-IGE + model with 8 extractors.

Figure 3 :
Figure 3: Illustration of the proposed intent-guided information extractor.

Figure 4 :
Figure 4: Deployment of the proposed relevance model

Figure 5 :
Figure 5: Importance of different document fields.

Figure 6 :
Figure 6: Impacts of different number of extractors.
according to the corresponding intention signals, indicate each predefined intention is hit or not hit by the input query, obtained from query understanding

Table 1 :
Performance of structured pretraining models on relevance task

Table 2 :
Comparison of the proposed matching methods.Since BERT uses multiple layers of transformers, its efficiency decrease as the number of layers increases.For the sake of efficiency, we deploy the distilled BERT for online serving.As illustrated in Figure4, to further lower the latency, we cache the extracted multiple embeddings of the documents offline.We also cache embeddings of the top queries offline, while calculate that of the long-tail queries online.The intent signal of queries are obtained from query understanding module in search engine.Then, the matching layer uses these results to calculate relevance score.Finally, this score is discretized into relevance level, and used by the search system for stratification, namely sorting the search results stably by the relevance level in descending order.
4.6.1 Deployment.Meituan search faces tens of millions of queries every day, so the online system has strict restriction on latency.