skip to main content
research-article
Free Access

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

Authors Info & Claims
Published:20 August 2021Publication History
Skip Abstract Section

Abstract

In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.

References

  1. John Klein, Ian Gorton, Neil Ernst, Patrick Donohoe, Kim Pham, and Chrisjan Matser. 2015. Performance evaluation of NoSQL databases: A case study. In Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems, Austin, Texas, USA. ACM, 5–10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levandoski, James Hunter, and Mike Barnett. 2018. FASTER: A concurrent key-value store with in-place updates. In Proceedings of SIGMOD’18. New York, NY, 275–290.Google ScholarGoogle Scholar
  3. Song Jingkuan, Yi Yang, Xuelong Li, Zi Huang, and Yang Yang. 2014. Robust hashing with local models for approximate similarity search. IEEE Transactions on Cybernetics 44, 7 (2014), 1225–1236.Google ScholarGoogle Scholar
  4. Brian Babcock, Surajit Chaudhuri, and Gautam Das. 2003. Dynamic sample selection for approximate query processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM, 539–550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael Hausenblas and Jacques Nadeau. 2013. Apache Drill: Interactive ad-hoc analysis at scale. Big Data 1, 2 (2013), 100–104.Google ScholarGoogle ScholarCross RefCross Ref
  6. M. Kornacker et al. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research.Google ScholarGoogle Scholar
  7. Thusoo Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626–1629.Google ScholarGoogle Scholar
  8. Saha Bikas, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1357–1369.Google ScholarGoogle Scholar
  9. Chien-Hsing Wu, Shu-Chen Kao, and Koji Okuhara. 2013. Examination and comparison of conflicting data in granulated datasets: Equal width interval vs. Equal Frequency Interval. Inf. Sci. 239 (2013), 154–164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Amdahl's law. Retrieved June 22, 2021 from https://en.wikipedia.org/wiki/Amdahl%27s_law.Google ScholarGoogle Scholar
  11. Jie Song, HongYan He, Zhi Wang, Ge Yu, and Jean-Marc Pierson. 2018. Modulo based data placement algorithm for energy consumption optimization of MapReduce system. J. Grid Comput 16, 3 (2018), 409–424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Apache Hadoop. Retrieved June 22, 2021 from https://hadoop.apache.org/.Google ScholarGoogle Scholar
  13. Huang Shengsheng, Huang Jie, Dai Jinquan, Xie Tao, and Huang. Bo. 2011. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. New Frontiers in Information and Software as Services. Springer, Berlin.Google ScholarGoogle Scholar
  14. Owen O'Malley. 2008. Terabyte sort on Apache Hadoop. Retrieved June 22, 2021 from http://sortbenchmark.org/Yahoo-Hadoop.pdf.Google ScholarGoogle Scholar
  15. Jie Song Shu Xu, Li Zhang, Claus Pahl, and Ge Yu. 2015. Performance and energy optimization on Terasort algorithm by task self-resizing. Inf. Technol. Control 44, 1 (2015), 30–40.Google ScholarGoogle Scholar
  16. Floratou Avrilia, Umar Farooq Minhas, and Fatma Özcan. 2014. SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proceedings of the VLDB Endowment 7, 12 (2014), 1295–1306.Google ScholarGoogle Scholar
  17. Apache Parquet. Retrieved June 22, 2021 from http://parquet.apache.org/.Google ScholarGoogle Scholar
  18. Requeno José Ignacio, Iñigo Gascón, and José Merseguer. 2018. Towards the performance analysis of Apache tez applications. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 147–152.Google ScholarGoogle Scholar
  19. Christian Prokopp. 2014. ORC: An intelligent big data file format for Hadoop and Hive. Retrieved June 22, 2021 from https://www.semantikoz.com/blog/orc-intelligent-big-data-file-format-hadoop-hive/.Google ScholarGoogle Scholar
  20. Tapdiya Ashish and Daniel Fabbri. 2017. A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics. In IEEE International Conference on Big Data (Big Data’17). IEEE, 1349–1356.Google ScholarGoogle Scholar
  21. Agarwal Sameer, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, and Ion Stoica. 2014. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 481–492.Google ScholarGoogle Scholar
  22. Qarabaqi Bahar and Mirek Riedewald. 2016. Merlin: Exploratory analysis with imprecise queries. IEEE Transactions on Knowledge and Data Engineering 28, 2 (2016), 342–355.Google ScholarGoogle Scholar
  23. Guagliardo Paolo and Leonid Libkin. 2016. Making SQL queries correct on incomplete databases: A feasibility study. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 211–223.Google ScholarGoogle Scholar
  24. Dimitriadou Kyriaki, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: An automatic query steering framework for interactive data exploration. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 517–528.Google ScholarGoogle Scholar
  25. Hina A. Khan, Mohamed A. Sharaf, and Abdullah Albarrak. 2014. DivIDE: Efficient diversification for interactive data exploration. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management. ACM, 15.Google ScholarGoogle Scholar
  26. Oracle Inc. 2018. Query relaxation. Retrieved June 22, 2021 from https://docs.oracle.com/database/121/CCAPP/GUID-7DD2AF6B-88FD-40B7-A522-3F59309D3B35.htm.Google ScholarGoogle Scholar
  27. Mottin Davide, Alice Marascu, Senjuti Basu Roy, Gautam Das, Themis Palpanas, and Yannis Velegrakis. 2014. IQR: An interactive query relaxation system for the empty-answer problem. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 1095–1098.Google ScholarGoogle Scholar
  28. Kantere Verena, George Orfanoudakis, Anastasios Kementsietsidis, and Timos Sellis. 2015. Query relaxation across heterogeneous data sources. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 473–482.Google ScholarGoogle Scholar
  29. Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online aggregation. Sigmod Record 26, 2 (1997), 171–182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Joseph M. Hellerstein, Ron Avnur, Andy Chou, Christian Hidber, Chris Olston, Vijayshankar Raman, Tali Roth, and Peter J. Haas. 1999. Interactive data analysis: The control project. Computer 32, 8 (1999), 51–59.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Joglekar Manas, Hector Garcia-Molina, and Aditya G. Parameswaran. 2017. Interactive data exploration with smart drill-down (extended version). IEEE Transactions on Knowledge & Data Engineering 1 (2017), 1–1.Google ScholarGoogle Scholar
  32. Miao Xiaoye, Yunjun Gao, Su Guo, and Wanqi Liu. 2017. Incomplete data management: A survey. Frontiers of Computer Science (2017), 1–22.Google ScholarGoogle Scholar
  33. Antova Lyublena, Christoph Koch, and Dan Olteanu. 2007. From complete to incomplete information and back. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. ACM, 713–724.Google ScholarGoogle Scholar
  34. Miao Xiaoye, Yunjun Gao, Baihua Zheng, Gang Chen, and Huiyong Cui. 2016. Top-k dominating queries on incomplete data. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 252–266.Google ScholarGoogle Scholar
  35. Miao Xiaoye, Yunjun Gao, Gang Chen, Baihua Zheng, and Huiyong Cui. 2016. Processing incomplete k nearest neighbor search. IEEE Transactions on Fuzzy Systems 24, 6 (2016), 1349–1363.Google ScholarGoogle Scholar
  36. Cheng Wei, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang. 2014. Searching dimension incomplete databases. IEEE Transactions on Knowledge and Data Engineering 26, 3 (2014), 725–738.Google ScholarGoogle Scholar
  37. Robert B. Miller. 1968. Response time in man-computer conversational transactions. In Proceedings of the December 9-11, 1968, Fall Joint Computer Conference part I. ACM, 267–277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai 14, 2 (1995), 1137–1145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Park Yongjoo, Michael Cafarella, and Barzan Mozafari. 2016. Visualization-aware sampling for very large databases. In IEEE 32nd International Conference on Data Engineering (ICDE’16). IEEE, 755–766.Google ScholarGoogle Scholar
  40. Kim Albert, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid sampling for visualizations with ordering guarantees. Proceedings of the VLDB Endowment 8, 5 (2015), 521–532.Google ScholarGoogle Scholar
  41. Armbrust Michael, Bill Chambers, and Matei Zaharia. 2017. Databricks delta: A unified data management system for real-time big data. Retrieved June 30, 2021 from https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-real-time-big-data.html.Google ScholarGoogle Scholar
  42. Ramnarayan Jags, Barzan Mozafari, Sumedh Wale, Sudhir Menon, Neeraj Kumar, Hemant Bhanawat, Soubhik Chakraborty, Yogesh Mahajan, Rishitesh Mishra, and Kishor Bachhav. 2016. SnappyData: A hybrid transactional analytical store built on spark. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2153–2156.Google ScholarGoogle Scholar
  43. Presto: Distributed SQL query engine for big data. Retrieved June 22, 2021 from https://prestodb.io/.Google ScholarGoogle Scholar
  44. Dalvi Nilesh and Dan Suciu. 2007. Efficient query evaluation on probabilistic databases. The VLDB Journal—The International Journal on Very Large Data Bases 16, 4 (2007), 523–544.Google ScholarGoogle Scholar
  45. Zeng Kai, Shi Gao, Jiaqi Gu, Barzan Mozafari, and Carlo Zaniolo. 2014. ABS: A system for scalable approximate queries with accuracy guarantees. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 1067–1070.Google ScholarGoogle Scholar
  46. Minos N. Garofalakis and Phillip B. Gibbons. 2001. Approximate query processing: Taming the terabytes. In VLDB. 343–352.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Cormode Graham, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. 2011. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1–3 (2011), 1–294.Google ScholarGoogle Scholar
  48. Chaudhuri Surajit, Bolin Ding, and Srikanth Kandula. 2017. Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 511–519.Google ScholarGoogle Scholar
  49. Wang Yijie, Xiaoyong Li, Xiaoling Li, and Yuan Wang. 2013. A survey of queries over uncertain data. Knowledge and Information Systems 37, 3 (2013), 485–530.Google ScholarGoogle Scholar
  50. Sarma Anish Das, Omar Benjelloun, Alon Halevy, and Jennifer Widom. 2006. Working models for uncertain data. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE, 7–7.Google ScholarGoogle Scholar
  51. Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen–Chuan Chang. 2008. Probabilistic top-k and ranking-aggregate queries. ACM Transactions on Database Systems 33, 3 (2008), 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Cheng Reynold, Dmitri V. Kalashnikov, and Sunil Prabhakar. 2007. Evaluation of probabilistic queries over imprecise data in constantly-evolving environments. Information Systems 32, 1 (2007), 104–130.Google ScholarGoogle Scholar
  53. Cheng Reynold, Dmitri V. Kalashnikov, and Sunil Prabhakar. 2004. Querying imprecise data in moving object environments. IEEE Transactions on Knowledge and Data Engineering 16, 9 (2004), 1112–1127.Google ScholarGoogle Scholar

Index Terms

  1. PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Article Metrics

      • Downloads (Last 12 months)68
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!