Abstract
Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiency.
The software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs. Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning.
For further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 6.9% of the area and 4.3% of the power of a Xeon core in the same technology generation.
- Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. 1999. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB'99). 266--277. Google Scholar
Digital Library
- Spyros Blanas, Yinan Li, and Jignesh M. Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In Proceedings of the International Conference on Management of Data. 37--48. Google Scholar
Digital Library
- Bluespec, Inc. 2012. Bluespec Core Technology. Retrieved July 29, 2014, from http://www.bluespec.com.Google Scholar
- Haran Boral and David J. DeWitt. 1983. Database machines: An idea whose time has passed? In Proceedings of the International Workshop on Database Machines.Google Scholar
- Robert D. Cameron and Dan Lin. 2009. Architectural support for SWAR text processing with parallel bit streams: The inductive doubling principle. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 337--348. Google Scholar
Digital Library
- Centrum Wiskunde and Informatica. 2012. An Open-Source Database System. Retrieved July 29, 2014, from http://www.monetdb.org.Google Scholar
- Samarjit Chakraborty and Lothar Thiele. 2005. A new task model for streaming applications and its schedulability analysis. In Proceedings of the Conference on Design, Automation and Test in Europe. 486--491. Google Scholar
Digital Library
- Damianos Chatziantoniou and Kenneth A. Ross. 2007. Partitioned optimization of complex queries. Information Systems 32, 2, 248--282. Google Scholar
Digital Library
- John Cieslewicz and Kenneth A. Ross. 2008. Data partitioning on chip multiprocessors. In Proceedings of the 4th International Workshop on Data Management on New Hardware. 25--34. Google Scholar
Digital Library
- Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The reconfigurable streaming vector processor (RSVPTM). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. 141. Google Scholar
Digital Library
- Brian F. Cooper and Karsten Schwan. 2005. Distributed stream management using utility-driven self-adaptive middleware. In Proceedings of the 2nd International Conference on Automatic Computing. 3--14. Google Scholar
Digital Library
- Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. 2011. MemScale: Active low-power modes for main memory. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 225--238. Google Scholar
Digital Library
- Michael Duller, Jan S. Rellermeyer, Gustavo Alonso, and Nesime Tatbul. 2011. Virtualizing stream processing. In Proceedings of the 12th ACM/IFIP/USENIX International Conference on Middleware. 269--288. Google Scholar
Digital Library
- Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A Joao, Onur Mutlu, and Yale N Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 362--373. Google Scholar
Digital Library
- Brian Flachs, Shigehiro Asano, Sang H. Dhong, Peter Hotstee, Gilles Gervais, Roy Kim, Tien Le, Peichun Liu, Jens Leenstra, John Liberty, Brad Michael, Hwa-Joon Oh, Silvia M. Mueller, Osamu Takahashi, A. Hatakeyama, Yukio Watanabe, and Naoka Yano. 2005. A streaming processing unit for a CELL processor. In Proceedings of the International Solid-State Circuits Conference. 134--135.Google Scholar
Cross Ref
- Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari Cadambi, R. Reed Taylor, and Ronald Laufer. 1999. PipeRench: A co/processor for streaming multimedia acceleration. In Proceedings of the 26th International Symposium on Computer Architecture. 28--39. Google Scholar
Digital Library
- Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google Scholar
Digital Library
- Naga K. Govindaraju and Dinesh Manocha. 2005. Efficient relational database management using graphics processors. In Proceedings of the 1st International Workshop on Data Management on New Hardware. Article No. 1. Google Scholar
Digital Library
- Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Symposium on High Performance Computer Architecture. 503--514. Google Scholar
Digital Library
- Goetz Graefe and Per-Ake Larson. 2001. B-Tree indexes and CPU caches. In Proceedings of the 17th International Conference on Data Engineering. 349--358. Google Scholar
Digital Library
- Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2011. Toward dark silicon in servers. IEEE Micro 31, 4, 6--15. Google Scholar
Digital Library
- HP Labs. 2013. Retrieved July 29, 2014, from http://www.hpl.hp.com/research/cacti/.Google Scholar
- IBM. 2006. DB2 Partitioning Features. Retrieved July 29, 2014, from http://www.ibm.com/developerworks/ data/library/techarticle/dm-0608mcinerney.Google Scholar
- Intel Corporation. 2012. Intel®Xeon®Processor E5620 (12M Cache, 2.40 GHz, 5.86 GT/s Intel®QPI). (2012). Retrieved July 29, 2014, from http://ark.intel.com/products/47925/intel-xeon-processor-e5620-12m- cache-2_40-ghz-5_86-gts-intel-qpi.Google Scholar
- Intel Corporation. 2013. Intel 64®and IA-32 Architectures Software Developer's Manual. (2013). Retrieved July 29, 2014, from http://download.intel.com/products/processor/manual/253669.pdf.Google Scholar
- Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. In Proceedings of the 35th International Symposium on Computer Architecture. 39--50. Google Scholar
Digital Library
- Navendu Jain, Lisa Amini, Henrique Andrade, Richard King, Yoonho Park, Philippe Selo, and Chitra Venkatramani. 2006. Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 431--442. Google Scholar
Digital Library
- Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture. 364--373. Google Scholar
Digital Library
- Ujval J. Kapasi, Scott Rixner, William J. Dally, Brucek Khailany, Jung Ho Ahn, Peter Mattson, and John D. Owens. 2003. Programmable stream processors. IEEE Computer 36, 8, 54--62. Google Scholar
Digital Library
- Changkyu Kim, Eric Sedlar, Jatin Chhugani, Tim Kaldewey, Anthony D. Nguyen, Andrea Di Blas, Victor W. Lee, Nadathur Satish, and Pradeep Dubey. 2009. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proceedings of the VLDB Endowment 2, 2, 1378--1389. Google Scholar
Digital Library
- Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the walkers: Accelerating index traversals for in-memory databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 468--479. Google Scholar
Digital Library
- Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. 2010. Server engineering insights for large-scale online services. IEEE Micro 30, 4, 8--19. Google Scholar
Digital Library
- Jens Krueger, Changkyu Kim, Martin Grund, Nadathur Satish, David Schwalb, Jatin Chhugani, Hasso Plattner, Pradeep Dubey, and Alexander Zeier. 2011. Fast updates on read-optimized databases using multi-core CPUs. Proceedings of the VLDB Endowment 5, 1, 61--72. Google Scholar
Digital Library
- Dan Lin, Nigel Medforth, Kenneth S. Herdy, Arrvindh Shriraman, and Rob Cameron. 2012. Parabix: Boosting the efficiency of text processing on commodity processors. In Proceedings of the 18th International Symposium on High-Performance Computer Architecture. 1--12. Google Scholar
Digital Library
- Krishna T. Malladi, Benjamin C. Lee, Frank A. Nothaft, Christos Kozyrakis, Karthika Periyathambi, and Mark Horowitz. 2012. Towards energy-proportional datacenter memory with mobile DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture. 37--48. Google Scholar
Digital Library
- Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. What happens during a join? Dissecting CPU and memory optimization effects. In Proceedings of the 26th International Conference on Very Large Data Bases. 339--350. Google Scholar
Digital Library
- Microsoft. 2012. Microsoft SQL Server 2012. Retrieved July 30, 2014, from http://technet.microsoft.com/ en-us/sqlserver/ff898410.Google Scholar
- Mohan C. Mohan. 2011. Impact of recent hardware and software trends on high performance transaction processing and analytics. In Proceedings of the 2nd TPC Technology Conference on Performance Evaluation, Measurement and Characterization of Complex Systems. 85--92. Google Scholar
Digital Library
- René Müller and Jens Teubner. 2010. FPGAs: A new point in the database design space. In Proceedings of the 13th International Conference on Extending Database Technology. 721--723. Google Scholar
Digital Library
- MySQL. 2014. Date and time datatype representation. Retrieved July 30, 2014, from http://dev.mysql.com/ doc/internals/en/date-and-time-data-type-representation.html.Google Scholar
- Chitra Natarajan, Bruce Christenson, and Fayé Briggs. 2004. A study of performance impact of memory controller features in multi-processor server environment. In Proceedings of the 3rd Workshop on Memory Performance Issues. 80--87. Google Scholar
Digital Library
- Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops. 170--177. Google Scholar
Digital Library
- Oracle. 2013. Oracle Database 11g: Partitioning. Retrieved July 30, 2014, from http://www.oracle.com/ technetwork/database/options/partitioning/index.html.Google Scholar
- Orestis Polychroniou and Kenneth A. Ross. 2014. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 755--766. Google Scholar
Digital Library
- Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. 2007. Effective management of DRAM bandwidth in multicore processors. In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques. 245--258. Google Scholar
Digital Library
- Scott Rixner. 2004. Memory controller optimizations for Web servers. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. 355--366. Google Scholar
Digital Library
- Paul Saab. 2008. Scaling Memcached at Facebook. Retrieved July 30, 2014, from https://www.facebook.com/ note.php?note_id=39391378919.Google Scholar
- Valentina Salapura, Tejas Karkhanis, Priya Nagpurkar, and Jose Moreira. 2012. Accelerating business analytics applications. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. 1--10. Google Scholar
Digital Library
- Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 351--362. Google Scholar
Digital Library
- Jun Shao and Brian T. Davis. 2007. A burst scheduling access reordering mechanism. In Proceedings of the 13th International Symposium on High Performance Computer Architecture. 285--294. Google Scholar
Digital Library
- Hari Subramoni, Fabrizio Petrini, Virat Agarwal, and Davide Pasetto. 2010. Intra-socket and inter-socket communication in multi-core systems. IEEE Computer Architecture Letters 9, 1, 13--16. Google Scholar
Digital Library
- Synopsys, Inc. 2013. 32/28nm Generic Library for IC Design, Design Compiler, IC Compiler. Available at http://www.synopsys.com.Google Scholar
- L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. 2011. The impact of memory subsystem resource sharing on datacenter applications. In Proceedings of the International Symposium on Computer Architecture. Google Scholar
Digital Library
- Transaction Processing Performance Council. 2014. TPC-H. Retrieved July 30, 2014, from http://www.tpc. org/tpch/default.asp.Google Scholar
- Jan Wassenberg and Peter Sanders. 2011. Engineering a multi-core radix sort. In Proceedings of the 17th International Conference on Parallel Processing, Volume Part II. 169--169. Google Scholar
Digital Library
- Matthew A. Watkins and David H. Albonesi. 2010. ReMAP: A reconfigurable heterogeneous multicore architecture. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 497--508. Google Scholar
Digital Library
- Louis Woods, Jens Teubner, and Gustavo Alonso. 2010. Complex event detection at wire speed with FPGAs. Proceedings of the VLDB Endowment 3, 1, 660--669. Google Scholar
Digital Library
- Lisa Wu, Raymond J. Barker, Martha A. Kim, and Kenneth A. Ross. 2013. Navigating big data with high-throughput, energy-efficient data partitioning. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 249--260. Google Scholar
Digital Library
- Yang Ye, Kenneth A. Ross, and Norases Vesdapunt. 2011. Scalable aggregation on multicore processors. In Proceedings of the 7th International Workshop on Data Management on New Hardware. 1--9. Google Scholar
Digital Library
- Jingren Zhou and Kenneth A. Ross. 2002. Implementing database operations using SIMD instructions. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 145--156. Google Scholar
Digital Library
Index Terms
Energy Analysis of Hardware and Software Range Partitioning
Recommendations
Navigating big data with high-throughput, energy-efficient data partitioning
ICSA '13The global pool of data is growing at 2.5 quintillion bytes per day, with 90% of it produced in the last two years alone [24]. There is no doubt the era of big data has arrived. This paper explores targeted deployment of hardware accelerators to improve ...
Navigating big data with high-throughput, energy-efficient data partitioning
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureThe global pool of data is growing at 2.5 quintillion bytes per day, with 90% of it produced in the last two years alone [24]. There is no doubt the era of big data has arrived. This paper explores targeted deployment of hardware accelerators to improve ...
Q100: the architecture and design of a database processing unit
ASPLOS '14In this paper, we propose Database Processing Units, or DPUs, a class of domain-specific database processors that can efficiently handle database applications. As a proof of concept, we present the instruction set architecture, microarchitecture, and ...






Comments