article

The string B-tree: a new data structure for string search in external memory and its applications

Online:01 March 1999Publication History

Abstract

We introduce a new text-indexing data structure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching data structures. In a short phrase, it is a combination of B-trees and Patricia tries for internal-node indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String B-Tree overcomes the theoretical limitations of inverted files, B-trees, prefix B-trees, suffix arrays, compacted tries and suffix trees. String B-trees have the same worst-case performance as B-trees but they manage unbounded-length strings and perform much more powerful search operations such as the ones supported by suffix trees. String B-trees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.

References

  1. AGGARWAL, A., AND VITTER, J. S. 1988. The Input/Output complexity of sorting and related problems. Commun. ACM, 1116-1127. Google ScholarGoogle Scholar
  2. AHO, A. V., HOPCROFT, J. E., AND ULLMAN, J. D. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass. Google ScholarGoogle Scholar
  3. AJTAI, M., FREDMAN, M., AND KOMLOS, J. 1984. Hash functions for priority queues. Ins. Cont. 63, 3, 217-225. Google ScholarGoogle Scholar
  4. AMm, A., FARACH, M., GALm, Z., GIANCARLO, R., AND PARK, K. 1994a. Dynamic dictionary match. Comput. Syst. Sci. 49, 208-222. Google ScholarGoogle Scholar
  5. AMIR, A., FARACH, M., AND MUTHUKRISHNAN, S. 1994b. Alphabet dependence in parameterized matching. Inf. Proc. Lett. 49, 111-115. Google ScholarGoogle Scholar
  6. ANDERSSON, A., AND NILSSON, S. 1995. Efficient implementation of suffix trees. Sofiw. Pract. Exp. 25, 2, 129-141. Google ScholarGoogle Scholar
  7. APOSTOLICO, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. A. Apostolico and Z. Galil, Eds. NATO ASI Series F: Computer and System Sciences, Springer- Verlag, New York, pp. 85-96.Google ScholarGoogle Scholar
  8. ARGE, L., FERRAGINA, P., GROSSI, R., AND VITTER, J. S. 1997. 02 sorting strings in external memory. In Proceedings of the 29th ACM Symposium on Theory of Computing (El Paso, Tex., May 4-6). ACM, New York, pp. 540-548. Google ScholarGoogle Scholar
  9. BAKER, B.S. 1993. A theory of parameterized pattern matching: Algorithms and applications. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing (San Diego, Calif., May 16-18). ACM, New York, pp. 71-80. Google ScholarGoogle Scholar
  10. BAKER, B.S. 1995. Parameterized pattern matching by Boyer-Moore-type algorithms. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 22-24), ACM, New York, pp. 541-550. Google ScholarGoogle Scholar
  11. BAYER, R., AND MCCREIGHT, C. 1972. Organization and maintenance of large ordered indexes. Acta Inf. 1, 3, 173-189.Google ScholarGoogle Scholar
  12. BAYER, R., AND UNTERAUER, K. 1977. Prefix B-trees. ACM Trans. Database Syst. 2, 1 (Jan.), 11-26. Google ScholarGoogle Scholar
  13. BENTLEY, J. L., AND SEDGEWICK, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of the 8th ACM-SIAM Symposium on Discrete Algorithms, ACM, New York, pp. 360-369. Google ScholarGoogle Scholar
  14. CHURCH, K. W., AND CHAPLIN, L.F. 1995. Commercial applications of natural language processing. Commun. ACM 38, 71-79. Google ScholarGoogle Scholar
  15. CLARK, D. R., AND MUNRO, J.I. 1996. Efficient suffix trees on secondary storage. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (Atlanta, Ga., Jan. 28-30), ACM, New York, pp. 383-391. Google ScholarGoogle Scholar
  16. CLEARY, J. G., AND WITTEN, I.H. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 396-402.Google ScholarGoogle Scholar
  17. COMER, D. 1979. The ubiquitous B-Tree. Comput. Surv. 11, 121-137. Google ScholarGoogle Scholar
  18. CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R.L. 1990. Introduction to Algorithms. MIT Press, Cambridge, Mass. Google ScholarGoogle Scholar
  19. FERRAGINA, P., AND GROSSI, R. 1995a. Fast incremental text editing. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 22-24). ACM, New York, pp. 531-540. Google ScholarGoogle Scholar
  20. FERRAGINA, P., AND GROSSI, R. 1995b. A fully-dynamic data structure for external substring search. In Proceedings of the 27th Annual ACM Symposium on the Theory of Computing (Las Vegas, Nev., May 29-June 1). ACM, New York, pp. 693-702. Google ScholarGoogle Scholar
  21. FERRAGINA, P., AND GROSSI, R. 1996. Fast string searching in secondary storage: Theoretical developments and experimental results. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (Atlanta, Ga., Jan. 28-30). ACM, New York, pp. 373-382. Google ScholarGoogle Scholar
  22. FERRAGINA, P., AND LUCCIO, L. 1998. Dynamic dictionary matching in external memory. Inf. Comput., 146, 85-98. Google ScholarGoogle Scholar
  23. FREDMAN, M. L., KOMLOS, J., AND SZEMERI#DI, E. 1984. Storing a sparse table with O(1) worst case access time. J. ACM 31, 3 (July), 538-544. Google ScholarGoogle Scholar
  24. FRENKEL, K.A. 1991. The human genome project and informatics. Commun. ACM 34, 1 (Jan.), 41-51. Google ScholarGoogle Scholar
  25. GONNET, G. H., BAEZA-YATES, R. A., AND SNIDER, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, N.J., pp. 66-82. Google ScholarGoogle Scholar
  26. GOODRICH, M. T., TSAY, J.-J., VENOROFF, D. E., AND VITTER, J. S. 1993. External-memory computational geometry. In Proceedings of the 34th IEEE Symposium of Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., pp. 714-723.Google ScholarGoogle Scholar
  27. GRAY, H. J., AND PRYWES, N. S. 1959. Outline for a multi-list organized system (Paper 41). In Annual Meeting of the ACM (Cambridge, Mass., Sept. 1-3). ACM, New York. Google ScholarGoogle Scholar
  28. GUSFIELD, D., LANDAU, G. M., AND SCHIEBER, B. 1992. An efficient algorithm for all pairs suffix-prefix problem. Inf. Proc. Lett. 41, 181-185. Google ScholarGoogle Scholar
  29. IDURY, R. M., AND SCHAFFER, A.A. 1996. Multiple matching of parameterized patterns. Theoret. Comput. Sci. 154, 203-224. Google ScholarGoogle Scholar
  30. KARP, R. M., MILLER, R. E., AND ROSENBERG, A. L. 1972. Rapid identification of repeated patterns in strings, arrays and trees. In Proceedings of the 4th Annual ACM Symposium on Theory of Computing (Denver, Colo., May 1-3). ACM, New York, pp. 125-136. Google ScholarGoogle Scholar
  31. KEPHART, J., SORKIN, G., ARNOLD, W., CHESS, D., TESAURO, G., AND WHITE, S. 1995. Biologically inspired defenses against computer viruses. In International Joint Conference on Artificial Intelligence, pp. 1-12. Google ScholarGoogle Scholar
  32. KNUTH, D.E. 1973. Sorting and searching. In The Art of Computer Programming. Addison-Wesley, Reading, Mass. Google ScholarGoogle Scholar
  33. KOSARAJU, R. 1995. Faster algorithms for the construction of parameterized suffix trees. In Proceedings of the 36th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., pp. 631-639. Google ScholarGoogle Scholar
  34. LUM, V.Y.1970. Multi-attribute retrieval with combined indexes. Commun. ACM 13, 11 (Nov.), 660-665. Google ScholarGoogle Scholar
  35. MAIER, D., AND SALVETER, S.C. 1981. Hysterical B-trees. Inf. Proc. Lett. 12, 4, 199-202.Google ScholarGoogle Scholar
  36. MANBER, U., AND MYERS, G.1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935-948. Google ScholarGoogle Scholar
  37. MCCREIGHT, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (Apr.), 262-272. Google ScholarGoogle Scholar
  38. MOFFAT, A. 1990. Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, (Apr.), 1917-1921.Google ScholarGoogle Scholar
  39. MORRISON, D.R. 1968. PATRICIA: Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4 (Oct.), 514-534. Google ScholarGoogle Scholar
  40. O'NEIL, P.E. 1992. The SB-tree: An index-sequential structure for high-performance sequential access. ACTA Inf. 29, 241-265. Google ScholarGoogle Scholar
  41. PATT, N. P. 1994. The I/O Subsystem: A candidate for improvement. IEEE Computer (Special Issue). Google ScholarGoogle Scholar
  42. SHANG, H. 1995. Trie methods for text and spatial data structures on secondary storage. Ph.D. dissertation. McGill Univ. Google ScholarGoogle Scholar
  43. SPRUONOLI, R. 1981. On the allocation of binary trees in secondary storage. BIT 21, 305-316.Google ScholarGoogle Scholar
  44. SUBRAMANIAN, S., AND RAMASWAMY, S. 1995. The P-range tree: A new data structure for range searching in secondary memory. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 22-24), ACM, New York, pp. 378-387. Google ScholarGoogle Scholar
  45. TARJAN, R. 1979. A class of algorithms that require nonlinear time to maintain disjoint sets. J. Comput. Syst. Sci. 18, 110-127.Google ScholarGoogle Scholar
  46. TARJAN, R.E. 1983. Data Structures and Network Algorithms. vol. CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, Pa. Google ScholarGoogle Scholar
  47. VITTER, J. S. ED. 1994. Special Double Issue on "Large-Scale Memories." Algorithmica 12, 2-3, 69-71.Google ScholarGoogle Scholar
  48. VITTER, J. S. 1999. External memory algorithms and data structures. In External Memory Algorithms and Visualization. J. Abello and J. S. Vitter, eds. In DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematics Society Press, Providence, R.I. Google ScholarGoogle Scholar
  49. VITTER, J. S., AND SHRIVER, E.A. 1994. Algorithms for parallel memory: Two-level memories. Algorithmica 12, 110-147.Google ScholarGoogle Scholar
  50. WEINBEROER, P.J. 1981. Unix B-trees. Tech. Rep. TM-81-11272-1, AT&T Bell Laboratories.Google ScholarGoogle Scholar
  51. WEINER, P. 1973. Linear pattern matching algorithm. In Proceedings of the IEEE Symposium on Switching and Automata Theory. IEEE Computer Society Press, Los Alamitos, Calif., pp. 1-11.Google ScholarGoogle Scholar
  52. ZIV, J., AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Info. Theory 23, 3, 337-343.Google ScholarGoogle Scholar
  53. ZIv, J., AND LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Info. Theory 24, 5, 530-536.{ref1}Google ScholarGoogle Scholar

Index Terms

  1. The string B-tree: a new data structure for string search in external memory and its applications

                        Comments

                        Login options

                        Check if you have access through your login credentials or your institution to get full access on this article.

                        Sign in

                        Full Access

                        PDF Format

                        View or Download as a PDF file.

                        PDF

                        eReader

                        View online with eReader.

                        eReader
                        About Cookies On This Site

                        We use cookies to ensure that we give you the best experience on our website.

                        Learn more

                        Got it!