Abstract
In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.
- John Klein, Ian Gorton, Neil Ernst, Patrick Donohoe, Kim Pham, and Chrisjan Matser. 2015. Performance evaluation of NoSQL databases: A case study. In Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems, Austin, Texas, USA. ACM, 5–10.Google Scholar
Digital Library
- Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levandoski, James Hunter, and Mike Barnett. 2018. FASTER: A concurrent key-value store with in-place updates. In Proceedings of SIGMOD’18. New York, NY, 275–290.Google Scholar
- Song Jingkuan, Yi Yang, Xuelong Li, Zi Huang, and Yang Yang. 2014. Robust hashing with local models for approximate similarity search. IEEE Transactions on Cybernetics 44, 7 (2014), 1225–1236.Google Scholar
- Brian Babcock, Surajit Chaudhuri, and Gautam Das. 2003. Dynamic sample selection for approximate query processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, 539–550.Google Scholar
Digital Library
- Michael Hausenblas and Jacques Nadeau. 2013. Apache Drill: Interactive ad-hoc analysis at scale. Big Data 1, 2 (2013), 100–104.Google Scholar
Cross Ref
- M. Kornacker et al. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research.Google Scholar
- Thusoo Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626–1629.Google Scholar
- Saha Bikas, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1357–1369.Google Scholar
- Chien-Hsing Wu, Shu-Chen Kao, and Koji Okuhara. 2013. Examination and comparison of conflicting data in granulated datasets: Equal width interval vs. Equal Frequency Interval. Inf. Sci. 239 (2013), 154–164.Google Scholar
Digital Library
- Amdahl's law. Retrieved June 22, 2021 from https://en.wikipedia.org/wiki/Amdahl%27s_law.Google Scholar
- Jie Song, HongYan He, Zhi Wang, Ge Yu, and Jean-Marc Pierson. 2018. Modulo based data placement algorithm for energy consumption optimization of MapReduce system. J. Grid Comput 16, 3 (2018), 409–424.Google Scholar
Digital Library
- Apache Hadoop. Retrieved June 22, 2021 from https://hadoop.apache.org/.Google Scholar
- Huang Shengsheng, Huang Jie, Dai Jinquan, Xie Tao, and Huang. Bo. 2011. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. New Frontiers in Information and Software as Services. Springer, Berlin.Google Scholar
- Owen O'Malley. 2008. Terabyte sort on Apache Hadoop. Retrieved June 22, 2021 from http://sortbenchmark.org/Yahoo-Hadoop.pdf.Google Scholar
- Jie Song Shu Xu, Li Zhang, Claus Pahl, and Ge Yu. 2015. Performance and energy optimization on Terasort algorithm by task self-resizing. Inf. Technol. Control 44, 1 (2015), 30–40.Google Scholar
- Floratou Avrilia, Umar Farooq Minhas, and Fatma Özcan. 2014. SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proceedings of the VLDB Endowment 7, 12 (2014), 1295–1306.Google Scholar
- Apache Parquet. Retrieved June 22, 2021 from http://parquet.apache.org/.Google Scholar
- Requeno José Ignacio, Iñigo Gascón, and José Merseguer. 2018. Towards the performance analysis of Apache tez applications. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 147–152.Google Scholar
- Christian Prokopp. 2014. ORC: An intelligent big data file format for Hadoop and Hive. Retrieved June 22, 2021 from https://www.semantikoz.com/blog/orc-intelligent-big-data-file-format-hadoop-hive/.Google Scholar
- Tapdiya Ashish and Daniel Fabbri. 2017. A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics. In IEEE International Conference on Big Data (Big Data’17). IEEE, 1349–1356.Google Scholar
- Agarwal Sameer, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, and Ion Stoica. 2014. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 481–492.Google Scholar
- Qarabaqi Bahar and Mirek Riedewald. 2016. Merlin: Exploratory analysis with imprecise queries. IEEE Transactions on Knowledge and Data Engineering 28, 2 (2016), 342–355.Google Scholar
- Guagliardo Paolo and Leonid Libkin. 2016. Making SQL queries correct on incomplete databases: A feasibility study. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 211–223.Google Scholar
- Dimitriadou Kyriaki, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: An automatic query steering framework for interactive data exploration. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 517–528.Google Scholar
- Hina A. Khan, Mohamed A. Sharaf, and Abdullah Albarrak. 2014. DivIDE: Efficient diversification for interactive data exploration. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management. ACM, 15.Google Scholar
- Oracle Inc. 2018. Query relaxation. Retrieved June 22, 2021 from https://docs.oracle.com/database/121/CCAPP/GUID-7DD2AF6B-88FD-40B7-A522-3F59309D3B35.htm.Google Scholar
- Mottin Davide, Alice Marascu, Senjuti Basu Roy, Gautam Das, Themis Palpanas, and Yannis Velegrakis. 2014. IQR: An interactive query relaxation system for the empty-answer problem. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 1095–1098.Google Scholar
- Kantere Verena, George Orfanoudakis, Anastasios Kementsietsidis, and Timos Sellis. 2015. Query relaxation across heterogeneous data sources. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 473–482.Google Scholar
- Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online aggregation. Sigmod Record 26, 2 (1997), 171–182.Google Scholar
Digital Library
- Joseph M. Hellerstein, Ron Avnur, Andy Chou, Christian Hidber, Chris Olston, Vijayshankar Raman, Tali Roth, and Peter J. Haas. 1999. Interactive data analysis: The control project. Computer 32, 8 (1999), 51–59.Google Scholar
Digital Library
- Joglekar Manas, Hector Garcia-Molina, and Aditya G. Parameswaran. 2017. Interactive data exploration with smart drill-down (extended version). IEEE Transactions on Knowledge & Data Engineering 1 (2017), 1–1.Google Scholar
- Miao Xiaoye, Yunjun Gao, Su Guo, and Wanqi Liu. 2017. Incomplete data management: A survey. Frontiers of Computer Science (2017), 1–22.Google Scholar
- Antova Lyublena, Christoph Koch, and Dan Olteanu. 2007. From complete to incomplete information and back. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. ACM, 713–724.Google Scholar
- Miao Xiaoye, Yunjun Gao, Baihua Zheng, Gang Chen, and Huiyong Cui. 2016. Top-k dominating queries on incomplete data. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 252–266.Google Scholar
- Miao Xiaoye, Yunjun Gao, Gang Chen, Baihua Zheng, and Huiyong Cui. 2016. Processing incomplete k nearest neighbor search. IEEE Transactions on Fuzzy Systems 24, 6 (2016), 1349–1363.Google Scholar
- Cheng Wei, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang. 2014. Searching dimension incomplete databases. IEEE Transactions on Knowledge and Data Engineering 26, 3 (2014), 725–738.Google Scholar
- Robert B. Miller. 1968. Response time in man-computer conversational transactions. In Proceedings of the December 9-11, 1968, Fall Joint Computer Conference part I. ACM, 267–277.Google Scholar
Digital Library
- Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai 14, 2 (1995), 1137–1145.Google Scholar
Digital Library
- Park Yongjoo, Michael Cafarella, and Barzan Mozafari. 2016. Visualization-aware sampling for very large databases. In IEEE 32nd International Conference on Data Engineering (ICDE’16). IEEE, 755–766.Google Scholar
- Kim Albert, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid sampling for visualizations with ordering guarantees. Proceedings of the VLDB Endowment 8, 5 (2015), 521–532.Google Scholar
- Armbrust Michael, Bill Chambers, and Matei Zaharia. 2017. Databricks delta: A unified data management system for real-time big data. Retrieved June 30, 2021 from https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-real-time-big-data.html.Google Scholar
- Ramnarayan Jags, Barzan Mozafari, Sumedh Wale, Sudhir Menon, Neeraj Kumar, Hemant Bhanawat, Soubhik Chakraborty, Yogesh Mahajan, Rishitesh Mishra, and Kishor Bachhav. 2016. SnappyData: A hybrid transactional analytical store built on spark. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2153–2156.Google Scholar
- Presto: Distributed SQL query engine for big data. Retrieved June 22, 2021 from https://prestodb.io/.Google Scholar
- Dalvi Nilesh and Dan Suciu. 2007. Efficient query evaluation on probabilistic databases. The VLDB Journal—The International Journal on Very Large Data Bases 16, 4 (2007), 523–544.Google Scholar
- Zeng Kai, Shi Gao, Jiaqi Gu, Barzan Mozafari, and Carlo Zaniolo. 2014. ABS: A system for scalable approximate queries with accuracy guarantees. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 1067–1070.Google Scholar
- Minos N. Garofalakis and Phillip B. Gibbons. 2001. Approximate query processing: Taming the terabytes. In VLDB. 343–352.Google Scholar
Digital Library
- Cormode Graham, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. 2011. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1–3 (2011), 1–294.Google Scholar
- Chaudhuri Surajit, Bolin Ding, and Srikanth Kandula. 2017. Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 511–519.Google Scholar
- Wang Yijie, Xiaoyong Li, Xiaoling Li, and Yuan Wang. 2013. A survey of queries over uncertain data. Knowledge and Information Systems 37, 3 (2013), 485–530.Google Scholar
- Sarma Anish Das, Omar Benjelloun, Alon Halevy, and Jennifer Widom. 2006. Working models for uncertain data. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE, 7–7.Google Scholar
- Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen–Chuan Chang. 2008. Probabilistic top-k and ranking-aggregate queries. ACM Transactions on Database Systems 33, 3 (2008), 13.Google Scholar
Digital Library
- Cheng Reynold, Dmitri V. Kalashnikov, and Sunil Prabhakar. 2007. Evaluation of probabilistic queries over imprecise data in constantly-evolving environments. Information Systems 32, 1 (2007), 104–130.Google Scholar
- Cheng Reynold, Dmitri V. Kalashnikov, and Sunil Prabhakar. 2004. Querying imprecise data in moving object environments. IEEE Transactions on Knowledge and Data Engineering 16, 9 (2004), 1112–1127.Google Scholar
Index Terms
PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning
Recommendations
Top-k best probability queries on probabilistic data
DASFAA'12: Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part IIThere has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In relation to probabilistic data, the most common problem in answering top-k ...
Query Processing Techniques for Big Spatial-Keyword Data
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataThe widespread use of GPS-enabled cellular devices, i.e., smart phones, led to the popularity of numerous mobile applications, e.g., social networks, micro-blogs, mobile web search, and crowd-powered reviews. These applications generate large amounts of ...
Shared Execution Techniques for Business Data Analytics over Big Data Streams
SSDBM 2020: 32nd International Conference on Scientific and Statistical Database ManagementBusiness Data Analytics require processing of large numbers of data streams and the creation of materialized views in order to provide near real-time answers to user queries. Materializing the view of each query and refreshing it continuously as a ...






Comments