Abstract
Pre-trained language representation models (PLMs) such as BERT and Enhanced Representation through kNowledge IntEgration (ERNIE) have been integral to achieving recent improvements on various downstream tasks, including information retrieval. However, it is nontrivial to directly utilize these models for the large-scale web search due to the following challenging issues: (1) the prohibitively expensive computations of massive neural PLMs, especially for long texts in the web document, prohibit their deployments in the web search system that demands extremely low latency; (2) the discrepancy between existing task-agnostic pre-training objectives and the ad hoc retrieval scenarios that demand comprehensive relevance modeling is another main barrier for improving the online retrieval and ranking effectiveness; and (3) to create a significant impact on real-world applications, it also calls for practical solutions to seamlessly interweave the resultant PLM and other components into a cooperative system to serve web-scale data. Accordingly, we contribute a series of successfully applied techniques in tackling these exposed issues in this work when deploying the state-of-the-art Chinese pre-trained language model, i.e., ERNIE, in the online search engine system. We first present novel practices to perform expressive PLM-based semantic retrieval with a flexible poly-interaction scheme and cost-efficiently contextualize and rank web documents with a cheap yet powerful Pyramid-ERNIE architecture. We then endow innovative pre-training and fine-tuning paradigms to explicitly incentivize the query-document relevance modeling in PLM-based retrieval and ranking with the large-scale noisy and biased post-click behavioral data. We also introduce a series of effective strategies to seamlessly interwoven the designed PLM-based models with other conventional components into a cooperative system. Extensive offline and online experimental results show that our proposed techniques are crucial to achieving more effective search performance. We also provide a thorough analysis of our methodology and experimental results.
- [1] . 2020. Better fine-tuning by reducing representational collapse. arXiv:2008.03156. Retrieved from https://arxiv.org/abs/2008.03156.Google Scholar
- [2] . 2020. An empirical investigation towards efficient multi-domain language model pre-training. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 4854–4864.Google Scholar
Cross Ref
- [3] . 2019. Cloze-driven pretraining of self-attention networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 5360–5369.Google Scholar
Cross Ref
- [4] . 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https://arxiv.org/abs/2004.05150.Google Scholar
- [5] . 2020. Language models are few-shot learners. arXiv:2005.14165. Retrieved from https://arxiv.org/abs/2005.14165.Google Scholar
- [6] . 2010. From ranknet to lambdarank to lambdamart: An overview. Learning.Google Scholar
- [7] . 2021. Semantic models for the first-stage retrieval: A comprehensive review. arXiv:2103.04831. Retrieved from https://arxiv.org/abs/2103.04831.Google Scholar
- [8] . 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning. 129–136.Google Scholar
Digital Library
- [9] . 2019. Pre-training tasks for embedding-based large-scale retrieval. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [10] . 2012. Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Sys. 30, 1 (2012), 1–41.Google Scholar
Digital Library
- [11] . 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web. 1–10.Google Scholar
Digital Library
- [12] . 2020. Bias and debias in recommender system: A survey and future directions. arXiv:2010.03240. Retrieved from https://arxiv.org/abs/2010.03240.Google Scholar
- [13] . 2020. Rethinking attention with performers. arXiv:2009.14794. Retrieved from https://arxiv.org/abs/2009.14794.Google Scholar
- [14] . 2022. H-ERNIE: A multi-granularity pre-trained language model for web search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1478–1489.Google Scholar
Digital Library
- [15] . 1992. Probabilistic retrieval based on staged logistic regression. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 198–210.Google Scholar
Digital Library
- [16] . 2020. CogLTX: Applying BERT to long texts. In Advances in Neural Information Processing Systems.Google Scholar
- [17] . 2018. Triplet loss in siamese network for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV’18). 459–474.Google Scholar
Digital Library
- [18] . 2016. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 268–277.Google Scholar
Cross Ref
- [19] . 2003. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4 (November 2003), 933–969.Google Scholar
Digital Library
- [20] . 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119–139.Google Scholar
Digital Library
- [21] . 2011. Clickthrough-based latent semantic models for web search. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 675–684.Google Scholar
Digital Library
- [22] . 2020. Modularized transformer-based ranking framework. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar
- [23] . 2021. Rethink training of BERT rerankers in multi-stage retrieval pipeline. In Proceedings of the European Conference on Information Retrieval. Springer, 280–286.Google Scholar
Digital Library
- [24] . 2016. Deep Learning. Vol. 1. MIT Press Cambridge.Google Scholar
Digital Library
- [25] . 2020. Deep multifaceted transformers for multi-objective ranking in large-scale e-commerce recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2493–2500.Google Scholar
Digital Library
- [26] . 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 55–64.Google Scholar
Digital Library
- [27] . 2021. Enhanced doubly robust learning for debiasing post-click conversion rate estimation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 275–284.Google Scholar
Digital Library
- [28] . 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv:2004.10964. Retrieved from https://arxiv.org/abs/2004.10964.Google Scholar
- [29] . 2020. Improving deep learning for airbnb search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2822–2830.Google Scholar
Digital Library
- [30] . 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https://arxiv.org/abs/1503.02531.Google Scholar
- [31] . 2020. Dynabert: Dynamic bert with adaptive width and depth. arXiv:2004.04037. Retrieved from https://arxiv.org/abs/2004.04037.Google Scholar
- [32] . 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553–2561.Google Scholar
Digital Library
- [33] . 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2333–2338.Google Scholar
Digital Library
- [34] . 2017. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1 (2017), 6869–6898.Google Scholar
Digital Library
- [35] . 2019. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv:1905.01969.Google Scholar
- [36] . 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.Google Scholar
Cross Ref
- [37] . 2017. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, 243–250.Google Scholar
- [38] . 2019. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv:1911.03437.Google Scholar
- [39] . 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 133–142.Google Scholar
Digital Library
- [40] . 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Ling. 5 (2017), 339–351.Google Scholar
Cross Ref
- [41] . 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6769–6781.Google Scholar
Cross Ref
- [42] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’19). 4171–4186.Google Scholar
- [43] . 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48.Google Scholar
Digital Library
- [44] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. (2014).Google Scholar
- [45] . 2020. Reformer: The efficient transformer. arXiv:2001.04451. Retrieved from https://arxiv.org/abs/2001.04451.Google Scholar
- [46] . 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv:1806.08342. Retrieved from https://arxiv.org/abs/1806.08342.Google Scholar
- [47] . 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv:1909.11942. Retrieved from https://arxiv.org/abs/1909.11942.Google Scholar
- [48] . 2020. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. BioinformaticsBioinformatics 36, 4 (2020), 1234–1240.Google Scholar
- [49] . 2019. Latent retrieval for weakly supervised open domain question answering. arXiv:1906.00300. Retrieved from https://arxiv.org/abs/1906.00300.Google Scholar
- [50] . 2014. Semantic matching in search. Found. Trends Inf. Retriev. 7, 5 (2014), 343–469.Google Scholar
Digital Library
- [51] . 2007. Mcrank: Learning to rank using multiple classification and gradient boosting. Adv. Neural Inf. Process. Syst. 20 (2007).Google Scholar
- [52] . 2021. Pretrained transformers for text ranking: Bert and beyond. Synth. Lect. Hum. Lang. Technol. 14, 4 (2021), 1–325.Google Scholar
- [53] . 2009. Learning to rank for information retrieval. Found. Trends Inf. Retriev. 3, 3 (2009), 225–331.Google Scholar
Digital Library
- [54] . 2020. Decoupled graph convolution network for inferring substitutable and complementary items. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2621–2628.Google Scholar
Digital Library
- [55] . 2021. Pre-trained language model for web-scale retrieval in baidu search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3365–3375.Google Scholar
Digital Library
- [56] . 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692.Google Scholar
- [57] . 2020. Neural passage retrieval with improved negative contrast. arXiv:2010.12523. Retrieved from https://arxiv.org/abs/2010.12523.Google Scholar
- [58] . 2020. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2645–2652.Google Scholar
Digital Library
- [59] . 2013. A deep architecture for matching short texts. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13), Vol. 26. Curran Associates, Inc.Google Scholar
- [60] . 2020. Sparse, dense, and attentional representations for text retrieval. arXiv:2005.00181. Retrieved from https://arxiv.org/abs/2005.00181.Google Scholar
- [61] . 2022. Model-based unbiased learning to rank. arXiv:2207.11785. Retrieved from https://arxiv.org/abs/2207.11785.Google Scholar
- [62] . 2020. PROP: Pre-training with representative words prediction for ad-hoc retrieval. arXiv:2010.10137. Retrieved from https://arxiv.org/abs/2010.10137.Google Scholar
- [63] . 2021. B-PROP: Bootstrapped pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.Google Scholar
- [64] . 2014. Data Mining with Decision Trees: Theory and Applications. World scientific.Google Scholar
- [65] . 2012. Scalable distributed algorithm for approximate nearest neighbor search problem in high dimensional general metric spaces. In International Conference on Similarity Search and Applications. Springer, 132–147.Google Scholar
Digital Library
- [66] . 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Inf. Syst. 45 (2014), 61–68.Google Scholar
Cross Ref
- [67] . 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (2018), 824–836.Google Scholar
Digital Library
- [68] . 2018. Deep relevance ranking using enhanced document-query interactions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18).Google Scholar
Cross Ref
- [69] . 2018. An introduction to neural information retrieval. Found. Trends Inf. Retriev. (2018).Google Scholar
Digital Library
- [70] . 2017. Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web. 1291–1299.Google Scholar
Digital Library
- [71] . 2019. Passage re-ranking with BERT. arXiv:1901.04085. Retrieved from https://arxiv.org/abs/1901.04085.Google Scholar
- [72] . 2019. Multi-stage document ranking with BERT. arXiv:1910.14424. Retrieved from https://arxiv.org/abs/1910.14424.Google Scholar
- [73] . 2014. Semantic modelling with long-short-term memory for information retrieval. arXiv:1412.6629. Retrieved from https://arxiv.org/abs/1412.6629.Google Scholar
- [74] . 2015. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval.Google Scholar
- [75] . 2020. Modeling of pruning techniques for deep neural networks simplification. arXiv:2001.04062. Retrieved from https://arxiv.org/abs/200104062.Google Scholar
- [76] . 2018. Deep contextualized word representations. arXiv:1802.05365. Retrieved from https://arxiv.org/abs/1802.05365.Google Scholar
- [77] . 2011. Approximate nearest neighbor search small world approach. In Proceedings of the International Conference on Information and Communication Technologies & Applications, Vol. 17.Google Scholar
- [78] . 2021. The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. arXiv:2101.05667. Retrieved from https://arxiv.org/abs/2101.05667.Google Scholar
- [79] . 2020. Intermediate-task transfer learning with pretrained models for natural language understanding: When and why does it work? arXiv:2005.00628. Retrieved from https://arxiv.org/abs/2005.00628.Google Scholar
- [80] . 2020. Exploring transfer learning with T5: The text-to-text transfer transformer. Google AI Blog. https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html.Google Scholar
- [81] . 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc.Google Scholar
Digital Library
- [82] . 2004. Are loss functions all the same? Neural Comput. (2004).Google Scholar
Digital Library
- [83] . 2009. Semantic hashing. Int. J. Adv. Res. 50, 7 (2009), 969–978.Google Scholar
Digital Library
- [84] . 2015. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 373–382.Google Scholar
Digital Library
- [85] . 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 101–110.Google Scholar
Digital Library
- [86] . 2014. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web. 373–374.Google Scholar
Digital Library
- [87] . 2019. ERNIE: Enhanced representation through knowledge integration. arXiv:1904.09223.Google Scholar
- [88] . 2020. Efficient transformers: A survey. arXiv:2009.06732. Retrieved from https://arxiv.org/abs/2009.06732.Google Scholar
- [89] . 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).Google Scholar
- [90] . 2016. A deep architecture for semantic matching with multiple positional sentence representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
Cross Ref
- [91] . 2020. Cross-thought for sentence encoder pre-training. arXiv:2010.03652. Retrieved from https://arxiv.org/abs/2010.06352.Google Scholar
- [92] . 2020. Linformer: Self-attention with linear complexity. arXiv:2006.04768. Retrieved from https://arxiv.org/abs/2006.04768.Google Scholar
- [93] . 2018. Model-level dual learning. In International Conference on Machine Learning. PMLR, 5383–5392.Google Scholar
- [94] . 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 55–64.Google Scholar
Digital Library
- [95] . 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv:2007.00808.Google Scholar
- [96] . 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32 (2019).Google Scholar
- [97] . 2011. Learning discriminative projections for text similarity measures. In Proceedings of the 15th Conference on Computational Natural Language Learning. 247–256.Google Scholar
- [98] . 2016. Ranking relevance in yahoo search. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 323–332.Google Scholar
Digital Library
- [99] . 2020. Towards personalized and semantic retrieval: An end-to-end solution for e-commerce search via embedding learning. arXiv:2006.02282. Retrieved from https://arxiv.org/abs/2006.02282.Google Scholar
- [100] . 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328–11339.Google Scholar
- [101] . 2020. Multi-stage pre-training for low-resource domain adaptation. arXiv:2010.05904. Retrieved from https://arxiv.org/abs/2010.05904.Google Scholar
- [102] . 2011. Automatically generating questions from queries for community-based question answering. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’11).Google Scholar
- [103] . 2020. Memory-efficient embedding for recommendations. arXiv:2006.14827. Retrieved from https://arxiv.org/abs/2006.14827.Google Scholar
- [104] . 2020. Autoemb: Automated embedding dimensionality search in streaming recommendations. arXiv:2002.11252. Retrieved from https://arxiv.org/abs/2002.11252.Google Scholar
- [105] . 2020. Whole-chain recommendations. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1883–1891.Google Scholar
Digital Library
- [106] . 2007. A regression framework for learning ranking functions using relative relevance judgments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 287–294.Google Scholar
Digital Library
- [107] . 2020. Pre-training text-to-text transformers for concept-centric common sense. arXiv:2011.07956. Retrieved from https://arxiv.org/abs/2011.07956.Google Scholar
- [108] . 2019. Fine-tuning language models from human preferences. arXiv:1909.08593. Retrieved from https://arxiv.org/abs/1909.08593.Google Scholar
- [109] . 2022. A large scale search dataset for unbiased learning to rank. In NeurIPS Dataset Track.Google Scholar
- [110] . 2022. Approximated doubly robust search relevance estimation. In Proceedings of the 31th ACM International Conference on Information & Knowledge Management.Google Scholar
Digital Library
- [111] . 2019. Reinforcement learning to optimize long-term user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2810–2818.Google Scholar
Digital Library
- [112] . 2020. Pseudo dyna-Q: A reinforcement learning framework for interactive recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining. 816–824.Google Scholar
Digital Library
- [113] . 2020. Neural interactive collaborative filtering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 749–758.Google Scholar
Digital Library
- [114] . 2021. Pre-trained language model based ranking in baidu search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 4014–4022.Google Scholar
Digital Library
Index Terms
Pre-trained Language Model-based Retrieval and Ranking for Web Search
Recommendations
Pre-trained Language Model based Ranking in Baidu Search
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningAs the heart of a search engine, the ranking system plays a crucial role in satisfying users' information demands. More recently, neural rankers fine-tuned from pre-trained language models (PLMs) establish state-of-the-art ranking effectiveness. However,...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Rank aggregation model for meta search: an approach using text and rank analysis measures
Intelligent information processing IIOne problem domain of meta search is to combine and improve the precision of ranking results from various search systems. This paper describes a rank aggregation model that incorporates text analysis measure with existing rank-based method, e.g. Best ...






Comments