Abstract
We introduce a new text-indexing data structure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching data structures. In a short phrase, it is a combination of B-trees and Patricia tries for internal-node indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String B-Tree overcomes the theoretical limitations of inverted files, B-trees, prefix B-trees, suffix arrays, compacted tries and suffix trees. String B-trees have the same worst-case performance as B-trees but they manage unbounded-length strings and perform much more powerful search operations such as the ones supported by suffix trees. String B-trees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.
References
- AGGARWAL, A., AND VITTER, J. S. 1988. The Input/Output complexity of sorting and related problems. Commun. ACM, 1116-1127. Google Scholar
- AHO, A. V., HOPCROFT, J. E., AND ULLMAN, J. D. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass. Google Scholar
- AJTAI, M., FREDMAN, M., AND KOMLOS, J. 1984. Hash functions for priority queues. Ins. Cont. 63, 3, 217-225. Google Scholar
- AMm, A., FARACH, M., GALm, Z., GIANCARLO, R., AND PARK, K. 1994a. Dynamic dictionary match. Comput. Syst. Sci. 49, 208-222. Google Scholar
- AMIR, A., FARACH, M., AND MUTHUKRISHNAN, S. 1994b. Alphabet dependence in parameterized matching. Inf. Proc. Lett. 49, 111-115. Google Scholar
- ANDERSSON, A., AND NILSSON, S. 1995. Efficient implementation of suffix trees. Sofiw. Pract. Exp. 25, 2, 129-141. Google Scholar
- APOSTOLICO, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. A. Apostolico and Z. Galil, Eds. NATO ASI Series F: Computer and System Sciences, Springer- Verlag, New York, pp. 85-96.Google Scholar
- ARGE, L., FERRAGINA, P., GROSSI, R., AND VITTER, J. S. 1997. 02 sorting strings in external memory. In Proceedings of the 29th ACM Symposium on Theory of Computing (El Paso, Tex., May 4-6). ACM, New York, pp. 540-548. Google Scholar
- BAKER, B.S. 1993. A theory of parameterized pattern matching: Algorithms and applications. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing (San Diego, Calif., May 16-18). ACM, New York, pp. 71-80. Google Scholar
- BAKER, B.S. 1995. Parameterized pattern matching by Boyer-Moore-type algorithms. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 22-24), ACM, New York, pp. 541-550. Google Scholar
- BAYER, R., AND MCCREIGHT, C. 1972. Organization and maintenance of large ordered indexes. Acta Inf. 1, 3, 173-189.Google Scholar
- BAYER, R., AND UNTERAUER, K. 1977. Prefix B-trees. ACM Trans. Database Syst. 2, 1 (Jan.), 11-26. Google Scholar
- BENTLEY, J. L., AND SEDGEWICK, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of the 8th ACM-SIAM Symposium on Discrete Algorithms, ACM, New York, pp. 360-369. Google Scholar
- CHURCH, K. W., AND CHAPLIN, L.F. 1995. Commercial applications of natural language processing. Commun. ACM 38, 71-79. Google Scholar
- CLARK, D. R., AND MUNRO, J.I. 1996. Efficient suffix trees on secondary storage. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (Atlanta, Ga., Jan. 28-30), ACM, New York, pp. 383-391. Google Scholar
- CLEARY, J. G., AND WITTEN, I.H. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 396-402.Google Scholar
- COMER, D. 1979. The ubiquitous B-Tree. Comput. Surv. 11, 121-137. Google Scholar
- CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R.L. 1990. Introduction to Algorithms. MIT Press, Cambridge, Mass. Google Scholar
- FERRAGINA, P., AND GROSSI, R. 1995a. Fast incremental text editing. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 22-24). ACM, New York, pp. 531-540. Google Scholar
- FERRAGINA, P., AND GROSSI, R. 1995b. A fully-dynamic data structure for external substring search. In Proceedings of the 27th Annual ACM Symposium on the Theory of Computing (Las Vegas, Nev., May 29-June 1). ACM, New York, pp. 693-702. Google Scholar
- FERRAGINA, P., AND GROSSI, R. 1996. Fast string searching in secondary storage: Theoretical developments and experimental results. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (Atlanta, Ga., Jan. 28-30). ACM, New York, pp. 373-382. Google Scholar
- FERRAGINA, P., AND LUCCIO, L. 1998. Dynamic dictionary matching in external memory. Inf. Comput., 146, 85-98. Google Scholar
- FREDMAN, M. L., KOMLOS, J., AND SZEMERI#DI, E. 1984. Storing a sparse table with O(1) worst case access time. J. ACM 31, 3 (July), 538-544. Google Scholar
- FRENKEL, K.A. 1991. The human genome project and informatics. Commun. ACM 34, 1 (Jan.), 41-51. Google Scholar
- GONNET, G. H., BAEZA-YATES, R. A., AND SNIDER, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, N.J., pp. 66-82. Google Scholar
- GOODRICH, M. T., TSAY, J.-J., VENOROFF, D. E., AND VITTER, J. S. 1993. External-memory computational geometry. In Proceedings of the 34th IEEE Symposium of Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., pp. 714-723.Google Scholar
- GRAY, H. J., AND PRYWES, N. S. 1959. Outline for a multi-list organized system (Paper 41). In Annual Meeting of the ACM (Cambridge, Mass., Sept. 1-3). ACM, New York. Google Scholar
- GUSFIELD, D., LANDAU, G. M., AND SCHIEBER, B. 1992. An efficient algorithm for all pairs suffix-prefix problem. Inf. Proc. Lett. 41, 181-185. Google Scholar
- IDURY, R. M., AND SCHAFFER, A.A. 1996. Multiple matching of parameterized patterns. Theoret. Comput. Sci. 154, 203-224. Google Scholar
- KARP, R. M., MILLER, R. E., AND ROSENBERG, A. L. 1972. Rapid identification of repeated patterns in strings, arrays and trees. In Proceedings of the 4th Annual ACM Symposium on Theory of Computing (Denver, Colo., May 1-3). ACM, New York, pp. 125-136. Google Scholar
- KEPHART, J., SORKIN, G., ARNOLD, W., CHESS, D., TESAURO, G., AND WHITE, S. 1995. Biologically inspired defenses against computer viruses. In International Joint Conference on Artificial Intelligence, pp. 1-12. Google Scholar
- KNUTH, D.E. 1973. Sorting and searching. In The Art of Computer Programming. Addison-Wesley, Reading, Mass. Google Scholar
- KOSARAJU, R. 1995. Faster algorithms for the construction of parameterized suffix trees. In Proceedings of the 36th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., pp. 631-639. Google Scholar
- LUM, V.Y.1970. Multi-attribute retrieval with combined indexes. Commun. ACM 13, 11 (Nov.), 660-665. Google Scholar
- MAIER, D., AND SALVETER, S.C. 1981. Hysterical B-trees. Inf. Proc. Lett. 12, 4, 199-202.Google Scholar
- MANBER, U., AND MYERS, G.1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935-948. Google Scholar
- MCCREIGHT, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (Apr.), 262-272. Google Scholar
- MOFFAT, A. 1990. Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, (Apr.), 1917-1921.Google Scholar
- MORRISON, D.R. 1968. PATRICIA: Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4 (Oct.), 514-534. Google Scholar
- O'NEIL, P.E. 1992. The SB-tree: An index-sequential structure for high-performance sequential access. ACTA Inf. 29, 241-265. Google Scholar
- PATT, N. P. 1994. The I/O Subsystem: A candidate for improvement. IEEE Computer (Special Issue). Google Scholar
- SHANG, H. 1995. Trie methods for text and spatial data structures on secondary storage. Ph.D. dissertation. McGill Univ. Google Scholar
- SPRUONOLI, R. 1981. On the allocation of binary trees in secondary storage. BIT 21, 305-316.Google Scholar
- SUBRAMANIAN, S., AND RAMASWAMY, S. 1995. The P-range tree: A new data structure for range searching in secondary memory. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 22-24), ACM, New York, pp. 378-387. Google Scholar
- TARJAN, R. 1979. A class of algorithms that require nonlinear time to maintain disjoint sets. J. Comput. Syst. Sci. 18, 110-127.Google Scholar
- TARJAN, R.E. 1983. Data Structures and Network Algorithms. vol. CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, Pa. Google Scholar
- VITTER, J. S. ED. 1994. Special Double Issue on "Large-Scale Memories." Algorithmica 12, 2-3, 69-71.Google Scholar
- VITTER, J. S. 1999. External memory algorithms and data structures. In External Memory Algorithms and Visualization. J. Abello and J. S. Vitter, eds. In DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematics Society Press, Providence, R.I. Google Scholar
- VITTER, J. S., AND SHRIVER, E.A. 1994. Algorithms for parallel memory: Two-level memories. Algorithmica 12, 110-147.Google Scholar
- WEINBEROER, P.J. 1981. Unix B-trees. Tech. Rep. TM-81-11272-1, AT&T Bell Laboratories.Google Scholar
- WEINER, P. 1973. Linear pattern matching algorithm. In Proceedings of the IEEE Symposium on Switching and Automata Theory. IEEE Computer Society Press, Los Alamitos, Calif., pp. 1-11.Google Scholar
- ZIV, J., AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Info. Theory 23, 3, 337-343.Google Scholar
- ZIv, J., AND LEMPEL, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Info. Theory 24, 5, 530-536.{ref1}Google Scholar
Index Terms
The string B-tree: a new data structure for string search in external memory and its applications





Comments