ABSTRACT
The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average t_q=1+1/2Ω(b) disk accesses for any load factor α bounded away from $1$. However, such near-perfect performance is achieved only when b is known and the hash table is particularly tuned for working with such a blocking. In this paper we study if it is possible to build a cache-oblivious hash table that works well with any blocking. Such a hash table will automatically perform well across all levels of the memory hierarchy and does not need any hardware-specific tuning, an important feature in autonomous databases.
We first show that linear probing, a classical collision resolution strategy for hash tables, can be easily made cache-oblivious but it only achieves t_q = 1 + O(αb). Then we demonstrate that it is possible to obtain t_q = 1 + 1/2Ω(b), thus matching the cache-aware bound, if the following two conditions hold: (a) b is a power of 2; and (b) every block starts at a memory address divisible by b. Both conditions hold on a real machine, although they are not stated in the cache-oblivious model. Interestingly, we also show that neither condition is dispensable: if either of them is removed, the best obtainable bound is t_q=1+O(αb), which is exactly what linear probing achieves.
- P. Afshani, C. Hamilton, and N. Zeh. Cache-oblivious range reporting with optimal queries requires superlinear space. In Proc. Annual Symposium on Computational Geometry, 2009. Google Scholar
Digital Library
- A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116--1127, 1988. Google Scholar
Digital Library
- M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. Iacono, and A. López-Ortiz. The cost of cache-oblivious searching. In Proc. IEEE Symposium on Foundations of Computer Science, 2003. Google Scholar
Digital Library
- M. A. Bender, E. D. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. SIAM J. Comput., 35(2):341--358, 2005. Google Scholar
Digital Library
- G. S. Brodal and R. Fagerberg. On the limits of cache-obliviousness. In Proc. ACM Symposium on Theory of Computing, 2003. Google Scholar
Digital Library
- J. Carter and M. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18:143--154, 1979.Google Scholar
- E. Demaine. Cache-oblivious algorithms and data structures. In EEF Summer School on Massive Datasets. Springer Verlag, 2002.Google Scholar
- R. Fagin, J. Nievergelt, N. Pippenger, and H. Strong. Extendible hashing--a fast access method for dynamic files. ACM Transactions on Database Systems, 4(3):315--344, 1979. Google Scholar
Digital Library
- M. L. Fredman, J. Komlos, and E. Szemeredi. Storing a sparse table with o(1) worst-case access time. In Proc. 23rd Annu. IEEE Sympos. Found. Comput. Sci., pages 165--170, 1982. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE Symposium on Foundations of Computer Science, pages 285--298, 1999. Google Scholar
Digital Library
- G. H. Gonnet and P.-Å. Larson. External hashing with limited internal storage. Journal of the ACM, 35(1):161--184, 1988. Google Scholar
Digital Library
- B. He and Q. Luo. Cache-oblivious databases: Limitations and opportunities. ACM Transactions on Database Systems, 33(2), article 8, 2008. Google Scholar
Digital Library
- M. S. Jensen and R. Pagh. Optimality in external memory hashing. Algorithmica, 52(3):403--411, 2008. Google Scholar
Digital Library
- D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA, 1973.Google Scholar
- P.-Å. Larson. Dynamic hash tables. Communications of the ACM, 31(4):446--457, 1988. Google Scholar
Digital Library
- P.-Å Larson. Linear hashing with separators--a dynamic hashing scheme achieving one-access retrieval. ACM Transactions on Database Systems, 3(3):366--388, 1988. Google Scholar
Digital Library
- W. Litwin. Linear hashing: a new tool for file and table addressing. In Proc. International Conference on Very Large Data Bases, pages 212--223, 1980. Google Scholar
Digital Library
- M. Mitzenmacher and S. Vadhan. Why simple hash functions work: Exploiting the entropy in a data stream. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2008. Google Scholar
Digital Library
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google Scholar
Digital Library
- A. Pagh, R. Pagh, and M. RuÇzic. Linear probing with constant independence. In Proc. ACM Symposium on Theory of Computing, 2007. Google Scholar
Digital Library
- R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51:122--144, 2004. Google Scholar
Digital Library
- G. Tenenbaum. Introduction to analytic and probabilistic number theory. Cambridge Univ Press, 1995.Google Scholar
- E. Verbin and Q. Zhang. The limits of buffering: A tight lower bound for dynamic membership in the external memory model. In Proc. ACM Symposium on Theory of Computing, 2010. Google Scholar
Digital Library
- Z. Wei, K. Yi, and Q. Zhang. Dynamic external hashing: The limit of buffering. In Proc. ACM Symposium on Parallelism in Algorithms and Architectures, 2009. Google Scholar
Digital Library
Index Terms
Cache-oblivious hashing
Recommendations
Cache-Oblivious Hashing
The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b , searching for a ...
Cache-oblivious polygon indecomposability testing
PASCO '10: Proceedings of the 4th International Workshop on Parallel and Symbolic ComputationWe examine a cache-oblivious reformulation of the (iterative) polygon indecomposability test of [19]. We analyse the cache complexity of the iterative version of this test within the ideal-cache model and identify the bottlenecks affecting its memory ...
Low depth cache-oblivious algorithms
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architecturesIn this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (...






Comments