Abstract
Web search engines deploy large-scale selection services on CPUs to identify a set of web pages that match user queries. An FPGA-based accelerator can exploit various levels of parallelism and provide a lower latency, higher throughput, more energy-efficient solution than commodity CPUs. However, maintaining such a customized accelerator in a commercial search engine is challenging because selection services are changed often. This article presents our design for FlexSaaS (Flexible Selection as a Service), an FPGA-based accelerator for web search selection. To address efficiency and flexibility challenges, FlexSaaS abstracts computing models and separates memory access from computation. Specifically, FlexSaaS (i) contains a reconfigurable number of matching processors that can handle various possible query plans, (ii) decouples index stream reading from matching computation to fetch and decode index files, and (iii) includes a universal memory accessor that hides the complex memory hierarchy and reduces host data access latency. Evaluated on FPGAs in the selection service of a commercial web search--the Bing web search engine—FlexSaaS can be evolved quickly to adapt to new updates. Compared to the software baseline, FlexSaaS on Arria 10 reduces average latency by 30% and increases throughput by 1.5×.
- Intel. n.d. Altera SDK for OpenCL. Available at https://www.altera.com/.Google Scholar
- Intel. n.d. Intel Vtune Amplifier. Retrieved January 26, 2019 from https://software.intel.com/en-us/intel-vtune-amplifier-xe.Google Scholar
- OpenMP. n.d. OpenMP Home Page. Retrieved January 26, 2019 from https://www.openmp.org/.Google Scholar
- Xilinx. n.d. SDAccel Development Environment. Available at https://www.xilinx.com/.Google Scholar
- Falk Scholer, Hugh E. Williams, John Yiannis, and Justin Zobel. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 222--229. Google Scholar
Digital Library
- Vo Ngoc Anh and Alistair Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval 8, 1 (2005), 151--166. Google Scholar
Digital Library
- Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, et al. 2011. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment 4, 8 (2011), 470--481. Google Scholar
Digital Library
- Dan Blandford and Guy Blelloch. 2002. Index compression through document reordering. In Proceedings of the 2002 Data Compression Conference (DCC’02). IEEE, Los Alamitos, CA, 342--351. Google Scholar
Digital Library
- Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (WWW’98). Google Scholar
Digital Library
- Andrei Z. Broder and Monika Rauch Henzinger. 1998. Information retrieval on the web. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS’98). 6.Google Scholar
- Jared Casper and Kunle Olukotun. 2014. Hardware acceleration of database operations. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 151--160. Google Scholar
Digital Library
- Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, et al. 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, Los Alamitos, CA, 1--13. Google Scholar
Digital Library
- Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 1. Google Scholar
Digital Library
- Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using graphics processors for high performance IR query processing. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 421--430. Google Scholar
Digital Library
- Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando. 2006. Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems 24, 1 (2006), 51--78. Google Scholar
Digital Library
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 243--254. Google Scholar
Digital Library
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, et al. 2017. In-datacenter performance analysis of a tensor processing unit. arXiv:1704.04760. Google Scholar
Digital Library
- Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39--55. Google Scholar
Digital Library
- Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, et al. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 393--405. Google Scholar
Digital Library
- Ikuo Magaki, Moein Khazraee, Luis Vega Gutierrez, and Michael Bedford Taylor. 2016. ASIC clouds: Specializing the datacenter. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, 178--190. Google Scholar
Digital Library
- DAC Manning. 1995. Introduction. In Introduction to Industrial Minerals. Springer, 1--16.Google Scholar
- Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. 2014. SDA: Software-defined accelerator for large-scale DNN systems. In Proceedings of the 2014 IEEE Hot Chips 26 Symposium (HCS’14). IEEE, Los Alamitos, CA, 1--23.Google Scholar
Cross Ref
- Jian Ouyang, Wei Qi, Yong Wang, Yichen Tu, Jing Wang, and Bowen Jia2016. SDA: Software-defined accelerator for general-purpose big data analysis system. In Proceedings of the 2016 IEEE Hot Chips 28 Symposium (HCS’16). IEEE, Los Alamitos, CA, 1--23.Google Scholar
- Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, et al. 2017. Plasticine: A reconfigurable architecture for parallel patterns. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, New York, NY, 389--402. Google Scholar
Digital Library
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, Los Alamitos, CA, 13--24. Google Scholar
Digital Library
- Rimon Tadros. 2015. Accelerating Web Search Using GPUs. Ph.D. Dissertation. University of British Columbia.Google Scholar
- Wim Vanderbauwhede, Leif Azzopardi, and Mahmoud Moadeli. 2009. FPGA-accelerated information retrieval: High-efficiency document filtering. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’09). IEEE, Los Alamitos, CA, 417--422.Google Scholar
Cross Ref
- Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, and Gang Wang. 2009. A batched GPU algorithm for set intersection. In Proceedings of the 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN’09). IEEE, Los Alamitos, CA, 752--756. Google Scholar
Digital Library
- Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, and Jing Liu. 2010. Efficient lists intersection by CPU-GPU cooperative computing. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops, and Phd Forum (IPDPSW’10). IEEE, Los Alamitos, CA, 1--8.Google Scholar
- Jing Yan, Zhan-Xiang Zhao, Ning-Yi Xu, Xi Jin, Lin-Tao Zhang, and Feng-Hsiung Hsu. 2012. Efficient query processing for web search engine with FPGAs. In Proceedings of the IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’12). IEEE, Los Alamitos, CA, 97--100. Google Scholar
Digital Library
- Jiangong Zhang, Xiaohui Long, and Torsten Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th International Conference on World Wide Web. ACM, New York, NY, 387--396. Google Scholar
Digital Library
- Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Computing Surveys 38, 2 (2006), 6. Google Scholar
Digital Library
- Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE, Los Alamitos, CA, 59. Google Scholar
Digital Library
Index Terms
FlexSaaS: A Reconfigurable Accelerator for Web Search Selection
Recommendations
A High Throughput No-Stall Golomb-Rice Hardware Decoder
FCCM '13: Proceedings of the 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing MachinesInteger compression techniques can generally be classified as bit-wise and byte-wise approaches. Though at the cost of a larger processing time, bit-wise techniques typically result in a better compression ratio. The Golomb-Rice (GR) method is a bit-...
A study of results overlap and uniqueness among major web search engines
The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
An FPGA implementation for neural networks with the FDFM processor core approach
This paper presents a field programmable gate array FPGA implementation of a three-layer perceptron using the few DSP blocks and few block RAMs FDFM approach implemented in the Xilinx Virtex-6 family FPGA. In the FDFM approach, multiple processor cores ...






Comments