skip to main content
research-article

Energy Analysis of Hardware and Software Range Partitioning

Published:29 August 2014Publication History
Skip Abstract Section

Abstract

Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiency.

The software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs. Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning.

For further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 6.9% of the area and 4.3% of the power of a Xeon core in the same technology generation.

References

  1. Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. 1999. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB'99). 266--277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Spyros Blanas, Yinan Li, and Jignesh M. Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In Proceedings of the International Conference on Management of Data. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bluespec, Inc. 2012. Bluespec Core Technology. Retrieved July 29, 2014, from http://www.bluespec.com.Google ScholarGoogle Scholar
  4. Haran Boral and David J. DeWitt. 1983. Database machines: An idea whose time has passed? In Proceedings of the International Workshop on Database Machines.Google ScholarGoogle Scholar
  5. Robert D. Cameron and Dan Lin. 2009. Architectural support for SWAR text processing with parallel bit streams: The inductive doubling principle. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 337--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Centrum Wiskunde and Informatica. 2012. An Open-Source Database System. Retrieved July 29, 2014, from http://www.monetdb.org.Google ScholarGoogle Scholar
  7. Samarjit Chakraborty and Lothar Thiele. 2005. A new task model for streaming applications and its schedulability analysis. In Proceedings of the Conference on Design, Automation and Test in Europe. 486--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Damianos Chatziantoniou and Kenneth A. Ross. 2007. Partitioned optimization of complex queries. Information Systems 32, 2, 248--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. John Cieslewicz and Kenneth A. Ross. 2008. Data partitioning on chip multiprocessors. In Proceedings of the 4th International Workshop on Data Management on New Hardware. 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The reconfigurable streaming vector processor (RSVPTM). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. 141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Brian F. Cooper and Karsten Schwan. 2005. Distributed stream management using utility-driven self-adaptive middleware. In Proceedings of the 2nd International Conference on Automatic Computing. 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. 2011. MemScale: Active low-power modes for main memory. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 225--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Michael Duller, Jan S. Rellermeyer, Gustavo Alonso, and Nesime Tatbul. 2011. Virtualizing stream processing. In Proceedings of the 12th ACM/IFIP/USENIX International Conference on Middleware. 269--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A Joao, Onur Mutlu, and Yale N Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Brian Flachs, Shigehiro Asano, Sang H. Dhong, Peter Hotstee, Gilles Gervais, Roy Kim, Tien Le, Peichun Liu, Jens Leenstra, John Liberty, Brad Michael, Hwa-Joon Oh, Silvia M. Mueller, Osamu Takahashi, A. Hatakeyama, Yukio Watanabe, and Naoka Yano. 2005. A streaming processing unit for a CELL processor. In Proceedings of the International Solid-State Circuits Conference. 134--135.Google ScholarGoogle ScholarCross RefCross Ref
  16. Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari Cadambi, R. Reed Taylor, and Ronald Laufer. 1999. PipeRench: A co/processor for streaming multimedia acceleration. In Proceedings of the 26th International Symposium on Computer Architecture. 28--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Naga K. Govindaraju and Dinesh Manocha. 2005. Efficient relational database management using graphics processors. In Proceedings of the 1st International Workshop on Data Management on New Hardware. Article No. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Symposium on High Performance Computer Architecture. 503--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Goetz Graefe and Per-Ake Larson. 2001. B-Tree indexes and CPU caches. In Proceedings of the 17th International Conference on Data Engineering. 349--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2011. Toward dark silicon in servers. IEEE Micro 31, 4, 6--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. HP Labs. 2013. Retrieved July 29, 2014, from http://www.hpl.hp.com/research/cacti/.Google ScholarGoogle Scholar
  23. IBM. 2006. DB2 Partitioning Features. Retrieved July 29, 2014, from http://www.ibm.com/developerworks/ data/library/techarticle/dm-0608mcinerney.Google ScholarGoogle Scholar
  24. Intel Corporation. 2012. Intel®Xeon®Processor E5620 (12M Cache, 2.40 GHz, 5.86 GT/s Intel®QPI). (2012). Retrieved July 29, 2014, from http://ark.intel.com/products/47925/intel-xeon-processor-e5620-12m- cache-2_40-ghz-5_86-gts-intel-qpi.Google ScholarGoogle Scholar
  25. Intel Corporation. 2013. Intel 64®and IA-32 Architectures Software Developer's Manual. (2013). Retrieved July 29, 2014, from http://download.intel.com/products/processor/manual/253669.pdf.Google ScholarGoogle Scholar
  26. Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. In Proceedings of the 35th International Symposium on Computer Architecture. 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Navendu Jain, Lisa Amini, Henrique Andrade, Richard King, Yoonho Park, Philippe Selo, and Chitra Venkatramani. 2006. Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 431--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture. 364--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ujval J. Kapasi, Scott Rixner, William J. Dally, Brucek Khailany, Jung Ho Ahn, Peter Mattson, and John D. Owens. 2003. Programmable stream processors. IEEE Computer 36, 8, 54--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Changkyu Kim, Eric Sedlar, Jatin Chhugani, Tim Kaldewey, Anthony D. Nguyen, Andrea Di Blas, Victor W. Lee, Nadathur Satish, and Pradeep Dubey. 2009. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proceedings of the VLDB Endowment 2, 2, 1378--1389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the walkers: Accelerating index traversals for in-memory databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 468--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. 2010. Server engineering insights for large-scale online services. IEEE Micro 30, 4, 8--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jens Krueger, Changkyu Kim, Martin Grund, Nadathur Satish, David Schwalb, Jatin Chhugani, Hasso Plattner, Pradeep Dubey, and Alexander Zeier. 2011. Fast updates on read-optimized databases using multi-core CPUs. Proceedings of the VLDB Endowment 5, 1, 61--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dan Lin, Nigel Medforth, Kenneth S. Herdy, Arrvindh Shriraman, and Rob Cameron. 2012. Parabix: Boosting the efficiency of text processing on commodity processors. In Proceedings of the 18th International Symposium on High-Performance Computer Architecture. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Krishna T. Malladi, Benjamin C. Lee, Frank A. Nothaft, Christos Kozyrakis, Karthika Periyathambi, and Mark Horowitz. 2012. Towards energy-proportional datacenter memory with mobile DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. What happens during a join? Dissecting CPU and memory optimization effects. In Proceedings of the 26th International Conference on Very Large Data Bases. 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Microsoft. 2012. Microsoft SQL Server 2012. Retrieved July 30, 2014, from http://technet.microsoft.com/ en-us/sqlserver/ff898410.Google ScholarGoogle Scholar
  38. Mohan C. Mohan. 2011. Impact of recent hardware and software trends on high performance transaction processing and analytics. In Proceedings of the 2nd TPC Technology Conference on Performance Evaluation, Measurement and Characterization of Complex Systems. 85--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. René Müller and Jens Teubner. 2010. FPGAs: A new point in the database design space. In Proceedings of the 13th International Conference on Extending Database Technology. 721--723. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. MySQL. 2014. Date and time datatype representation. Retrieved July 30, 2014, from http://dev.mysql.com/ doc/internals/en/date-and-time-data-type-representation.html.Google ScholarGoogle Scholar
  41. Chitra Natarajan, Bruce Christenson, and Fayé Briggs. 2004. A study of performance impact of memory controller features in multi-processor server environment. In Proceedings of the 3rd Workshop on Memory Performance Issues. 80--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops. 170--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Oracle. 2013. Oracle Database 11g: Partitioning. Retrieved July 30, 2014, from http://www.oracle.com/ technetwork/database/options/partitioning/index.html.Google ScholarGoogle Scholar
  44. Orestis Polychroniou and Kenneth A. Ross. 2014. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 755--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. 2007. Effective management of DRAM bandwidth in multicore processors. In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques. 245--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Scott Rixner. 2004. Memory controller optimizations for Web servers. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. 355--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Paul Saab. 2008. Scaling Memcached at Facebook. Retrieved July 30, 2014, from https://www.facebook.com/ note.php?note_id=39391378919.Google ScholarGoogle Scholar
  48. Valentina Salapura, Tejas Karkhanis, Priya Nagpurkar, and Jose Moreira. 2012. Accelerating business analytics applications. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 351--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jun Shao and Brian T. Davis. 2007. A burst scheduling access reordering mechanism. In Proceedings of the 13th International Symposium on High Performance Computer Architecture. 285--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Hari Subramoni, Fabrizio Petrini, Virat Agarwal, and Davide Pasetto. 2010. Intra-socket and inter-socket communication in multi-core systems. IEEE Computer Architecture Letters 9, 1, 13--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Synopsys, Inc. 2013. 32/28nm Generic Library for IC Design, Design Compiler, IC Compiler. Available at http://www.synopsys.com.Google ScholarGoogle Scholar
  53. L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. 2011. The impact of memory subsystem resource sharing on datacenter applications. In Proceedings of the International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Transaction Processing Performance Council. 2014. TPC-H. Retrieved July 30, 2014, from http://www.tpc. org/tpch/default.asp.Google ScholarGoogle Scholar
  55. Jan Wassenberg and Peter Sanders. 2011. Engineering a multi-core radix sort. In Proceedings of the 17th International Conference on Parallel Processing, Volume Part II. 169--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Matthew A. Watkins and David H. Albonesi. 2010. ReMAP: A reconfigurable heterogeneous multicore architecture. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 497--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Louis Woods, Jens Teubner, and Gustavo Alonso. 2010. Complex event detection at wire speed with FPGAs. Proceedings of the VLDB Endowment 3, 1, 660--669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Lisa Wu, Raymond J. Barker, Martha A. Kim, and Kenneth A. Ross. 2013. Navigating big data with high-throughput, energy-efficient data partitioning. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 249--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yang Ye, Kenneth A. Ross, and Norases Vesdapunt. 2011. Scalable aggregation on multicore processors. In Proceedings of the 7th International Workshop on Data Management on New Hardware. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jingren Zhou and Kenneth A. Ross. 2002. Implementing database operations using SIMD instructions. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Energy Analysis of Hardware and Software Range Partitioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!