article

A guided tour to approximate string matching

Online:01 March 2001Publication History

Abstract

We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.

References

  1. AHO,A.AND CORASICK, M. 1975. Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 6, 333-340.]] Google ScholarGoogle Scholar
  2. AHO, A., HOPCROFT,J.,AND ULLMAN, J. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA.]] Google ScholarGoogle Scholar
  3. ALTSCHUL, S., GISH, W., MILLER, W., MYERS,G.,AND LIPMAN, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403-410.]]Google ScholarGoogle Scholar
  4. AMIR, A., LEWENSTEIN, M., AND LEWENSTEIN, N. 1997a. Pattern matching in hypertext. In Proceedings of the 5th International Workshop on Algorithms and Data Structures (WADS '97). LNCS, vol. 1272, Springer-Verlag, Berlin, 160-173.]] Google ScholarGoogle Scholar
  5. AMIR, A., AUMANN, Y., LANDAU, G., LEWENSTEIN, M., AND LEWENSTEIN, N. 1997b. Pattern matching with swaps. In Proceedings of the Foundations of Computer Science (FOCS'97), 1997, 144- 153.]] Google ScholarGoogle Scholar
  6. APOSTOLICO, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. Springer-Verlag, Barlin, 85-96.]]Google ScholarGoogle Scholar
  7. APOSTOLICO,A.AND GALIL, Z. 1985. Combinatorial Algorithms on Words. NATO ISI Series. Springer-Verlag, Berlin.]] Google ScholarGoogle Scholar
  8. APOSTOLICO,A.AND GALIL, Z. 1997. Pattern Matching Algorithms. Oxford University Press, Oxford, UK.]] Google ScholarGoogle Scholar
  9. APOSTOLICO,A.AND GUERRA, C. 1987. The Longest Common Subsequence problem revisited. Algorithmica 2, 315-336.]]Google ScholarGoogle Scholar
  10. ARAUJO, M., NAVARRO,G.,AND ZIVIANI, N. 1997. Large text searching allowing errors. In Proceedings of the 4th South American Workshop on String Processing (WSP '97), Carleton Univ. Press. 2-20.]]Google ScholarGoogle Scholar
  11. ARLAZAROV, V., DINIC, E., KONROD, M., AND FARADZEV, I. 1975. On economic construction of the transitive closure of a directed graph. Sov. Math. Dokl. 11, 1209, 1210. Original in Russian in Dokl. Akad. Nauk SSSR 194, 1970.]]Google ScholarGoogle Scholar
  12. ATALLAH, M., JACQUET,P.,AND SZPANKOWSKI, W. 1993. A probabilistic approach to pattern matching with mismatches. Random Struct. Algor. 4, 191- 213.]]Google ScholarGoogle Scholar
  13. BAEZA-YATES, R. 1989. Efficient Text Searching. Ph.D. thesis, Dept. of Computer Science, University of Waterloo. Also as Res. Rep. CS-89-17.]] Google ScholarGoogle Scholar
  14. BAEZA-YATES, R. 1991. Some new results on approximate string matching. In Workshop on Data Structures, Dagstuhl, Germany. Abstract.]]Google ScholarGoogle Scholar
  15. BAEZA-YATES, R. 1992. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress. Elsevier Science, Amsterdam. vol. I, 465-476.]] Google ScholarGoogle Scholar
  16. BAEZA-YATES, R. 1996. A unified view of string matching algorithms. In Proceedings of the Theory and Practice of Informatics (SOFSEM '96). LNCS, vol. 1175, Springer-Verlag, Berlin, 1-15.]] Google ScholarGoogle Scholar
  17. BAEZA-YATES,R.AND GONNET, G. 1992. A new approach to text searching. Commun. ACM 35, 10, 74-82. Preliminary version in ACM SIGIR '89.]] Google ScholarGoogle Scholar
  18. BAEZA-YATES,R.AND GONNET, G. 1994. Fast string matching with mismatches. Information and Computation 108, 2, 187-199. Preliminary version as Tech. Rep. CS-88-36, Data Structuring Group, Univ. of Waterloo, Sept. 1988.]]Google ScholarGoogle Scholar
  19. BAEZA-YATES,R.AND NAVARRO, G. 1997. Multiple approximate string matching. In Proceedings of the 5th International Workshop on Algorithms and Data Structures (WADS '97). LNCS, vol. 1272, 1997, Springer-Verlag, Berlin, 174-184.]] Google ScholarGoogle Scholar
  20. BAEZA-YATES,R.AND NAVARRO, G. 1998. New and faster filters for multiple approximate string matching. Tech. Rep. TR/DCC-98-10, Dept. of Computer Science, University of Chile. Random Struct. Algor. to appear. ftp://ftp. dcc.ptuchile.cl/pub/users/gnavarro/multi. ps.gz.]]Google ScholarGoogle Scholar
  21. BAEZA-YATES,R.AND NAVARRO, G. 1999. Faster approximate string matching. Algorithmica 23,2, 127-158. Preliminary versions in Proceedings of CPM '96 (LNCS, vol. 1075, 1996) and in Proceedings of WSP'96, Carleton Univ. Press, 1996.]] Google ScholarGoogle Scholar
  22. BAEZA-YATES,R.AND NAVARRO, G. 2000. Blockaddressing indices for approximate text retrieval. J. Am. Soc. Inf. Sci. (JASIS) 51, 1 (Jan.), 69-82.]] Google ScholarGoogle Scholar
  23. BAEZA-YATES,R.AND PERLEBERG, C. 1996. Fast and practical approximate pattern matching. Information Processing Letters 59, 21-27. Preliminary version in CPM '92 (LNCS, vol. 644. 1992).]] Google ScholarGoogle Scholar
  24. BAEZA-YATES,R.AND R' EGNIER, M. 1990. Fast algorithms for two dimensional and multiple pattern matching. In Proceedings of Scandinavian Workshop on Algorithmic Theory (SWAT '90). LNCS, vol. 447, Springer-Verlag, Berlin, 332-347.]] Google ScholarGoogle Scholar
  25. BAEZA-YATES,R.AND RIBEIRO-NETO, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading, MA.]] Google ScholarGoogle Scholar
  26. BLUMER, A., BLUMER, J., HAUSSLER, D., EHRENFEUCHT, A., CHEN, M., AND SEIFERAS, J. 1985. The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31-55.]]Google ScholarGoogle Scholar
  27. BOYER,R.AND MOORE, J. 1977. A fast string searching algorithm. Commun. ACM 20, 10, 762-772.]] Google ScholarGoogle Scholar
  28. CHANG,W.AND LAMPE, J. 1992. Theoretical and empirical comparisons of approximate string matching algorithms. In Proceedings of the 3d Annual Symposium on Combinatorial Pattern Matching (CPM '92). LNCS, vol. 644, Springer-Verlag, Berlin, 172-181.]] Google ScholarGoogle Scholar
  29. CHANG,W.AND LAWLER, E. 1994. Sublinear approximate string matching and biological applications. Algorithmica 12, 4/5, 327-344. Preliminary version in FOCS '90.]]Google ScholarGoogle Scholar
  30. CHANG,W.AND MARR, T. 1994. Approximate string matching and local similarity. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching (CPM '94). LNCS, vol. 807, Springer-Verlag, Berlin, 259-273.]] Google ScholarGoogle Scholar
  31. CHVATAL,V.AND SANKOFF, D. 1975. Longest common subsequences of two random sequences. J. Appl. Probab. 12, 306-315.]]Google ScholarGoogle Scholar
  32. COBBS, A. 1995. Fast approximate matching using suffix trees. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching (CPM '95), 41-54.]]Google ScholarGoogle Scholar
  33. COLE,R.AND HARIHARAN, R. 1998. Approximate string matching: a simpler faster algorithm. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SODA '98), 463-472.]] Google ScholarGoogle Scholar
  34. COMMENTZ-WALTER, B. 1979. A string matching algorithm fast on the average. In Proc. ICALP '79. LNCS, vol. 6, Springer-Verlag, Berlin, 118-132.]] Google ScholarGoogle Scholar
  35. CORMEN, T., LEISERSON,C.,AND RIVEST, R. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA.]] Google ScholarGoogle Scholar
  36. CROCHEMORE, M. 1986. Transducers and repetitions. Theor. Comput. Sci. 45, 63-86.]] Google ScholarGoogle Scholar
  37. CROCHEMORE,M.AND RYTTER, W. 1994. Text Algorithms. Oxford Univ. Press, Oxford, UK.]] Google ScholarGoogle Scholar
  38. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI,W.,AND RYTTER,W. 1994. Speeding up two string-matching algorithms. Algorithmica 12, 247-267.]]Google ScholarGoogle Scholar
  39. DAMERAU, F. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3, 171-176.]] Google ScholarGoogle Scholar
  40. DAS, G., FLEISHER, R., GASIENIEK, L., GUNOPULOS, D., AND K ARK AINEN, J. 1997. Episode matching. In Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching (CPM '97). LNCS, vol. 1264, Springer-Verlag, Berlin, 12-27.]] Google ScholarGoogle Scholar
  41. DEKEN, J. 1979. Some limit results for longest common subsequences. Discrete Math. 26, 17-31.]]Google ScholarGoogle Scholar
  42. DIXON,R.AND MARTIN, T. Eds. 1979. Automatic Speech and Speaker Recognition. IEEE Press, New York.]] Google ScholarGoogle Scholar
  43. EHRENFEUCHT,A.AND HAUSSLER, D. 1988. A new distance metric on strings computable in linear time. Discrete Appl. Math. 20, 191-203.]] Google ScholarGoogle Scholar
  44. ELLIMAN,D.AND LANCASTER, I. 1990. A review of segmentation and contextual analysis techniques for text recognition. Pattern Recog. 23, 3/4, 337- 346.]] Google ScholarGoogle Scholar
  45. FRENCH, J., POWELL, A., AND SCHULMAN, E. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the 6th ACM International Conference on Information and Knowledge Management (CIKM '97), 9-15.]] Google ScholarGoogle Scholar
  46. GALIL,Z.AND GIANCARLO, R. 1988. Data structures and algorithms for approximate string matching. J. Complexity 4, 33-72.]] Google ScholarGoogle Scholar
  47. GALIL,Z.AND PARK, K. 1990. An improved algorithm for approximate string matching. SIAM J. Comput. 19, 6, 989-999. Preliminary version in ICALP '89 (LNCS, vol. 372, 1989).]] Google ScholarGoogle Scholar
  48. GIEGERICH, R., KURTZ, S., HISCHKE,F.,AND OHLEBUSCH, E. 1997. A general technique to improve filter algorithms for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press. 38-52. Preliminary version as Tech. Rep. 96-01, Universit at Bielefeld, Germany, 1996.]]Google ScholarGoogle Scholar
  49. GONNET, G. 1992. A tutorial introduction to Computational Biochemistry using Darwin. Tech. rep., Informatik E. T. H., Zuerich, Switzerland.]]Google ScholarGoogle Scholar
  50. GONNET,G.AND BAEZA-YATES, R. 1991. Handbook of Algorithms and Data Structures, 2d ed. Addison-Wesley, Reading, MA.]] Google ScholarGoogle Scholar
  51. GONZALEZ,R.AND THOMASON, M. 1978. Syntactic Pattern Recognition. Addison-Wesley, Reading, MA.]]Google ScholarGoogle Scholar
  52. GOSLING, J. 1991. A redisplay algorithm. In Proceedings of ACM SIGPLAN/SIGOA Symposium on Text Manipulation, 123-129.]] Google ScholarGoogle Scholar
  53. GROSSI,R.AND LUCCIO, F. 1989. Simple and efficient string matching with k mismatches. Inf. Process. Lett. 33, 3, 113-120.]]Google ScholarGoogle Scholar
  54. GUSFIELD, D. 1997. Algorithms on Strings, Trees and Sequences. Cambridge Univ. Press, Cambridge.]] Google ScholarGoogle Scholar
  55. HALL,P.AND DOWLING, G. 1980. Approximate string matching. ACM Comput. Surv. 12, 4, 381-402.]] Google ScholarGoogle Scholar
  56. HAREL,D.AND TARJAN, E. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2, 338-355.]] Google ScholarGoogle Scholar
  57. HECKEL, P. 1978. A technique for isolating differences between files. Commun. ACM 21, 4, 264- 268.]] Google ScholarGoogle Scholar
  58. HOLSTI,N.AND SUTINEN, E. 1994. Approximate string matching using q-gram places. In Proceedings of 7th Finnish Symposium on Computer Science. Univ. of Joensuu. 23-32.]]Google ScholarGoogle Scholar
  59. HOPCROFT,J.AND ULLMAN, J. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA.]] Google ScholarGoogle Scholar
  60. HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice Exper. 10, 501-506.]]Google ScholarGoogle Scholar
  61. JOKINEN,P.AND UKKONEN, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings of the 2nd Mathematical Foundations of Computer Science (MFCS '91). Springer- Verlag, Berlin, vol. 16, 240-248.]]Google ScholarGoogle Scholar
  62. JOKINEN, P., TARHIO,J.,AND UKKONEN, E. 1996. Acomparison of approximate string matching algorithms. Software Practice Exper. 26, 12, 1439- 1458. Preliminary version in Tech. Rep. A-1991- 7, Dept. of Computer Science, Univ. of Helsinki, 1991.]] Google ScholarGoogle Scholar
  63. KARLOFF, H. 1993. Fast algorithms for approximately counting mismatches. Inf. Process. Lett. 48, 53-60.]] Google ScholarGoogle Scholar
  64. KECECIOGLU,J.AND SANKOFF, D. 1995. Exact and approximation algorithms for the inversion distance between two permutations. Algorithmica 13, 180-210.]]Google ScholarGoogle Scholar
  65. KNUTH, D. 1973. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, Reading, MA.]] Google ScholarGoogle Scholar
  66. KNUTH, D., MORRIS, J., JR, AND PRATT, V. 1977. Fast pattern matching in strings. SIAM J. Com-put. 6, 1, 323-350.]]Google ScholarGoogle Scholar
  67. KUKICH, K. 1992. Techniques for automatically correcting words in text. ACM Comput. Surv. 24,4, 377-439.]] Google ScholarGoogle Scholar
  68. KUMAR,S.AND SPAFFORD, E. 1994. A patternmatching model for intrusion detection. In Proceedings of the National Computer Security Conference, 11-21.]]Google ScholarGoogle Scholar
  69. KURTZ, S. 1996. Approximate string searching under weighted edit distance. In Proceedings of the 3rd South American Workshop on String Processing (WSP '96). Carleton Univ. Press. 156- 170.]]Google ScholarGoogle Scholar
  70. KURTZ,S.AND MYERS, G. 1997. Estimating the probability of approximate matches. In Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching (CPM '97). LNCS, vol. 1264, Springer-Verlag, Berlin, 52-64.]] Google ScholarGoogle Scholar
  71. LANDAU,G.AND VISHKIN, U. 1988. Fast string matching with k differences. J. Comput. Syst. Sci. 37, 63-78. Preliminary version in FOCS '85.]] Google ScholarGoogle Scholar
  72. LANDAU,G.AND VISHKIN, U. 1989. Fast parallel and serial approximate string matching. J. Algor. 10, 157-169. Preliminary version in ACMSTOC '86.]] Google ScholarGoogle Scholar
  73. LANDAU, G., MYERS, E., AND SCHMIDT, J. 1998. Incremental string comparison. SIAM J. Comput. 27, 2, 557-582.]] Google ScholarGoogle Scholar
  74. LAWRENCE,S.AND GILES, C. L. 1999. Accessibility of information on the web. Nature 400, 107-109.]]Google ScholarGoogle Scholar
  75. LEE, J., KIM, D., PARK, K., AND CHO, Y. 1997. Efficient algorithms for approximate string matching with swaps. In Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching (CPM '97). LNCS, vol. 1264, Springer-Verlag, Berlin, 28-39.]] Google ScholarGoogle Scholar
  76. LEVENSHTEIN, V. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transmission 1, 8-17.]]Google ScholarGoogle Scholar
  77. LEVENSHTEIN, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 8, 707-710. Original in Russian in Dokl. Akad. Nauk SSSR 163, 4, 845-848, 1965.]]Google ScholarGoogle Scholar
  78. LIPTON,R.AND LOPRESTI, D. 1985. A systolic array for rapid string comparison. In Proceedings of the Chapel Hill Conference on VLSI, 363- 376.]]Google ScholarGoogle Scholar
  79. LOPRESTI,D.AND TOMKINS, A. 1994. On the search-ability of electronic ink. In Proceedings of the 4th International Workshop on Frontiers in Handwriting Recognition, 156-165.]]Google ScholarGoogle Scholar
  80. LOPRESTI,D.AND TOMKINS, A. 1997. Block edit models for approximate string matching. Theor. Comput. Sci. 181, 1, 159-179.]] Google ScholarGoogle Scholar
  81. LOWRANCE,R.AND WAGNER, R. 1975. An extension of the string-to-string correction problem. J. ACM 22, 177-183.]] Google ScholarGoogle Scholar
  82. LUCZAK,T.AND SZPANKOWSKI, W. 1997. A suboptimal lossy data compression based on approximate pattern matching. IEEE Trans. Inf. Theor. 43, 1439-1451.]]Google ScholarGoogle Scholar
  83. MANBER,U.AND WU, S. 1994. GLIMPSE: A tool to search through entire file systems. In Proceedings of USENIX Technical Conference. USENIX Association, Berkeley, CA, USA. 23-32. Preliminary version as Tech. Rep. 93-34, Dept. of Computer Science, Univ. of Arizona, Oct. 1993.]] Google ScholarGoogle Scholar
  84. MASEK,W.AND PATERSON, M. 1980. A faster algorithm for computing string edit distances. J. Comput. Syst. Sci. 20, 18-31.]]Google ScholarGoogle Scholar
  85. MASTERS, H. 1927. A study of spelling errors. Univ. of Iowa Studies in Educ. 4,4.]]Google ScholarGoogle Scholar
  86. MCCREIGHT, E. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262- 272.]] Google ScholarGoogle Scholar
  87. MELICHAR, B. 1996. String matching with k differences by finite automata. In Proceedings of the International Congress on Pattern Recognition (ICPR '96). IEEE CS Press, Silver Spring, MD. 256-260. Preliminary version in Computer Anal-ysis of Images and Patterns (LNCS, vol. 970, 1995).]] Google ScholarGoogle Scholar
  88. MORRISON, D. 1968. PATRICIA-Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4, 514-534.]] Google ScholarGoogle Scholar
  89. MUTH,R.AND MANBER, U. 1996. Approximate multiple string search. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). LNCS, vol. 1075, Springer-Verlag, Berlin, 75-86.]] Google ScholarGoogle Scholar
  90. MYERS, G. 1994a. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 4/5, 345-374. Perliminary version in Tech. Rep. TR90-25, Computer Science Dept., Univ. of Arizona, Sept. 1991.]]Google ScholarGoogle Scholar
  91. MYERS, G. 1994b. Algorithmic Advances for Searching Biosequence Databases. Plenum Press, New York, 121-135.]]Google ScholarGoogle Scholar
  92. MYERS, G. 1986a. Incremental alignment algorithms and their applications. Tech. Rep. 86-22, Dept. of Computer Science, Univ. of Arizona.]]Google ScholarGoogle Scholar
  93. MYERS, G. 1986b. An O(ND) difference algorithm and its variations. Algorithmica 1, 251-266.]]Google ScholarGoogle Scholar
  94. MYERS, G. 1991. An overview of sequence comparison algorithms in molecular biology. Tech. Rep. TR-91-29, Dept. of Computer Science, Univ. of Arizona.]]Google ScholarGoogle Scholar
  95. MYERS, G. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic progamming. J. ACM 46, 3, 395-415. Earlier version in Proceedings of CPM'98 (LNCS, vol. 1448).]] Google ScholarGoogle Scholar
  96. NAVARRO, G. 1997a. Multiple approximate string matching by counting. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press, 125-139.]]Google ScholarGoogle Scholar
  97. NAVARRO, G. 1997b. A partial deterministic automaton for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press, 112-124.]]Google ScholarGoogle Scholar
  98. NAVARRO, G. 1998. Approximate Text Searching. Ph.D. thesis, Dept. of Computer Science, Univ. of Chile. Tech. Rep. TR/DCC-98-14. ftp://ftp. dcc.uchile.cl/pub/users/gnavarro/thesis98. ps.gz.]]Google ScholarGoogle Scholar
  99. NAVARRO, G. 2000a. Improved approximate pattern matching on hypertext. Theor. Comput. Sci., 237, 455-463. Previous version in Proceedings of LATIN '98 (LNCS, vol. 1380).]] Google ScholarGoogle Scholar
  100. NAVARRO, G. 2000b. Nrgrep: A fast and flexible pattern matching tool, Tech. Rep. TR/DCC-2000-3. Dept. of Computer Science, Univ. of Chile, Aug. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/ nrgrep.ps.gz.]]Google ScholarGoogle Scholar
  101. NAVARRO,G.AND BAEZA-YATES, R. 1998a. Improving an algorithm for approximate pattern matching. Tech. Rep. TR/DCC-98- 5, Dept. of Computer Science, Univ. of Chile. Algorithmica, to appear. ftp:// ftp.dcc.uchile.cl/pub/users/gnavarro/dexp. ps.gz.]]Google ScholarGoogle Scholar
  102. NAVARRO,G.AND BAEZA-YATES, R. 1998b. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1,2.http://www.clei.cl.]]Google ScholarGoogle Scholar
  103. NAVARRO,G.AND BAEZA-YATES, R. 1999a. Fast multidimensional approximate pattern matching. In Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching (CPM '99). LNCS, vol. 1645, Springer-verlag, Berlin, 243- 257. Extended version to appear in J. Disc. Algor. (JDA).]] Google ScholarGoogle Scholar
  104. NAVARRO,G.AND BAEZA-YATES, R. 1999b. A new indexing method for approximate string matching. In Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching (CPM '99), LNCS, vol. 1645, Springer-verlag, Berlin, 163- 185. Extended version to appear in J. Discrete Algor. (JDA).]] Google ScholarGoogle Scholar
  105. NAVARRO,G.AND BAEZA-YATES, R. 1999c. Very fast and simple approximate string matching. Inf. Process. Lett. 72, 65-70.]] Google ScholarGoogle Scholar
  106. NAVARRO,G.AND RAFFINOT, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM J. Exp. Algor. 5,4. Previous version in Proceedings of CPM '98. Lecture Notes in Computer Science, Springer-Verlag, New York.]] Google ScholarGoogle Scholar
  107. NAVARRO, G., MOURA, E., NEUBERT, M., ZIVIANI,N.,AND BAEZA-YATES, R. 2000. Adding compression to block addressing inverted indexes. Kluwer Inf. Retrieval J. 3, 1, 49-77.]] Google ScholarGoogle Scholar
  108. NEEDLEMAN,S.AND WUNSCH, C. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 444-453.]]Google ScholarGoogle Scholar
  109. NESBIT, J. 1986. The accuracy of approximate string matching algorithms. J. Comput.-Based Instr. 13, 3, 80-83.]] Google ScholarGoogle Scholar
  110. OWOLABI,O.AND MCGREGOR, R. 1988. Fast approximate string matching. Software Practice Exper. 18, 4, 387-393.]] Google ScholarGoogle Scholar
  111. REGNIER,M.AND SZPANKOWSKI, W. 1997. On the approximate pattern occurrence in a text. In Proceedings of Compression and Complexity of SEQUENCES '97. IEEE Press, New York.]] Google ScholarGoogle Scholar
  112. RIVEST, R. 1976. Partial-match retrieval algorithms. SIAM J. Comput. 5,1.]]Google ScholarGoogle Scholar
  113. SAHINALP,S.AND VISHKIN, U. 1997. Approximate pattern matching using locally consistent parsing. Manuscript, Univ. of Maryland Institute for Advanced Computer Studies (UMIACS).]]Google ScholarGoogle Scholar
  114. SANKOFF, D. 1972. Matching sequences under deletion/insertion constraints. In Proceedings of the National Academy of Sciences of the USA, vol. 69, 4-6.]]Google ScholarGoogle Scholar
  115. SANKOFF,D.AND KRUSKAL, J., Eds. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.]]Google ScholarGoogle Scholar
  116. SANKOFF,D.AND MAINVILLE, S. 1983. Common Subsequences and Monotone Subsequences. Addison-Wesley, Reading, MA, 363-365.]]Google ScholarGoogle Scholar
  117. SCHIEBER,B.AND VISHKIN, U. 1988. On finding lowest common ancestors: simplification and parallelization. SIAM J. Comput. 17, 6, 1253- 1262.]] Google ScholarGoogle Scholar
  118. SELLERS, P. 1974. On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 26, 787-793.]]Google ScholarGoogle Scholar
  119. SELLERS, P. 1980. The theory and computation of evolutionary distances: pattern recognition. J. Algor. 1, 359-373.]]Google ScholarGoogle Scholar
  120. SHI, F. 1996. Fast approximate string matching with q-blocks sequences. In Proceedings of the 3rd South American Workshop on String Processing (WSP'96). Carleton Univ. Press. 257- 271.]]Google ScholarGoogle Scholar
  121. SUNDAY, D. 1990. A very fast substring search algorithm. Commun. ACM 33, 8, 132-142.]] Google ScholarGoogle Scholar
  122. SUTINEN, E. 1998. Approximate Pattern Matching with the q-Gram Family. Ph.D. thesis, Dept. of Computer Science, Univ. of Helsinki, Finland. Tech. Rep. A-1998-3.]]Google ScholarGoogle Scholar
  123. SUTINEN,E.AND TARHIO, J. 1995. On using q-gram locations in approximate string matching. In Proceedings of the 3rd Annual European Symposium on Algorithms (ESA '95). LNCS, vol. 979, Springer-Verlag, Berlin, 327-340.]] Google ScholarGoogle Scholar
  124. SUTINEN,E.AND TARHIO, J. 1996. Filtration with qsamples in approximate string matching. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). LNCS, vol. 1075, Springer-Verlag, Berlin, 50-61.]] Google ScholarGoogle Scholar
  125. TAKAOKA, T. 1994. Approximate pattern matching with samples. In Proceedings of ISAAC '94. LNCS, vol. 834, Springer-Verlag, Berlin, 234- 242.]] Google ScholarGoogle Scholar
  126. TARHIO,J.AND UKKONEN, E. 1988. A greedy approximation algorithm for constructing shortest common superstrings. Theor. Comput. Sci. 57, 131- 145.]] Google ScholarGoogle Scholar
  127. TARHIO,J.AND UKKONEN, E. 1993. Approximate Boyer-Moore string matching. SIAM J. Com-put. 22, 2, 243-260. Preliminary version in SWAT'90 (LNCS, vol. 447, 1990).]] Google ScholarGoogle Scholar
  128. TICHY, W. 1984. The string-to-string correction problem with block moves. ACMTrans. Comput. Syst. 2, 4, 309-321.]] Google ScholarGoogle Scholar
  129. UKKONEN, E. 1985a. Algorithms for approximate string matching. Information and Control 64, 100-118. Preliminary version in Proceedings of the International Conference Foundations of Computation Theory (LNCS, vol. 158, 1983).]] Google ScholarGoogle Scholar
  130. UKKONEN, E. 1985b. Finding approximate patterns in strings. J. Algor. 6, 132-137.]]Google ScholarGoogle Scholar
  131. UKKONEN, E. 1992. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 1, 191-211.]] Google ScholarGoogle Scholar
  132. UKKONEN, E. 1993. Approximate string matching over suffix trees. In Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching (CPM '93), 228-242.]] Google ScholarGoogle Scholar
  133. UKKONEN, E. 1995. Constructing suffix trees online in linear time. Algorithmica 14, 3, 249- 260.]]Google ScholarGoogle Scholar
  134. UKKONEN,E.AND WOOD, D. 1993. Approximate string matching with suffix automata. Algorithmica 10, 353-364. Preliminary version in Rep. A-1990-4, Dept. of Computer Science, Univ. of Helsinki, Apr. 1990.]]Google ScholarGoogle Scholar
  135. ULLMAN, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Comput. J. 10, 141-147.]]Google ScholarGoogle Scholar
  136. VINTSYUK, T. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 52-58.]]Google ScholarGoogle Scholar
  137. WAGNER,R.AND FISHER, M. 1974. The string to string correction problem. J. ACM 21, 168-178.]] Google ScholarGoogle Scholar
  138. WATERMAN, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.]]Google ScholarGoogle Scholar
  139. WEINER, P. 1973. Linear pattern matching algorithms. In Proceedings of IEEE Symposium on Switching and Automata Theory, 1-11.]]Google ScholarGoogle Scholar
  140. WRIGHT, A. 1994. Approximate string matching using within-word parallelism. Software Practice Exper. 24, 4, 337-362.]] Google ScholarGoogle Scholar
  141. WU,S.AND MANBER, U. 1992a. Agrepfia fast approximate pattern-matching tool. In Proceedings of USENIX Technical Conference. USENIX Association, Berkeley, CA, USA. 153-162.]]Google ScholarGoogle Scholar
  142. WU,S.AND MANBER, U. 1992b. Fast text searching allowing errors. Commun. ACM 35, 10, 83-91.]] Google ScholarGoogle Scholar
  143. WU, S., MANBER,U.,AND MYERS, E. 1995. A subquadratic algorithm for approximate regular expression matching. J. Algor. 19, 3, 346-360.]] Google ScholarGoogle Scholar
  144. WU, S., MANBER,U.,AND MYERS, E. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15,1,50- 67. Preliminary version as Tech. Rep. TR29-36, Computer Science Dept., Univ. of Arizona, 1992.]]Google ScholarGoogle Scholar
  145. YAO, A. 1979. The complexity of pattern matching for a random string. SIAM J. Comput. 8, 368- 387.]]Google ScholarGoogle Scholar
  146. YAP, T., FRIEDER,O.,AND MARTINO, R. 1996. High Performance Computational Methods for Biological Sequence Analysis. Kluwer Academic Publishers, Dordrecht.]] Google ScholarGoogle Scholar
  147. ZOBEL,J.AND DART, P. 1996. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th ACM International Conference on Information Retrieval (SIGIR '96), 166- 172.]] Google ScholarGoogle Scholar

Reviews

Paul Cull

String matching is used in a variety of areas, including word processing, information retrieval, and computational biology. This survey considers the problem of locating within a text string all substrings which are nearly the same as an input pattern string. Distance between strings is measured by the edit distance or one of its variants. No preprocessing of the text, like building an index, is allowed. These restrictions may not be appropriate in all applications. For example, in some biological applications movement of gene segments may make edit distance an inappropriate measure of distance between strings. Also, in many information retrieval applications, indices are created, making the no preprocessing assumption false. In spite of these limitations, this is still an important and useful survey. The simplest approach to finding all occurrences of a pattern of length p within a text of length t is to compare the pattern with each run of p characters within the text. This will take time O(pt). If one wants to allow for a fixed number of errors, say k, independent of the lengths of the pattern and text then O(kpt) time will suffice. Can this obvious approach be improved__?__ This has been an active research area for the last twenty years, and this survey outlines the improvements that have been made. As is often the case in algorithms, there may be a difference between theoretically and practically good algorithms. For example, a theoretical algorithm may only be faster for impractically large problems, or a fast algorithm may use impractical amounts of space. So, very reasonably this survey not only discusses the theoretical advances, but also comments on the practicality of the various algorithms. The now classical algorithm for approximate string matching is a dynamic programming algorithm. This survey points out how improvements have been made by computing the dynamic programming matrix in various orders, by avoiding calculating unneeded parts of the matrix, and by trading space for time in these calculations. The survey also describes algorithms which are based on finite automata. In practice, both types of algorithms can be improved by using the inherent parallelism of the word level over the bit or character level. The author also describes filtering algorithms which will eliminate parts of the text where there cannot be approximate matches. In a second phase, the non-eliminated portions have to be checked to see if they really are approximate matches. A most interesting feature of this survey is the inclusion of some experimental comparisons of several of the discussed algorithms. The experiments compare the algorithms on data from various application areas. Not unexpectedly there is no clear winner. The behavior of the algorithms depends on the statistics of the input data. The author has available on his Web site a hybrid program which does string matching by heuristically choosing one of the three algorithms that his experiments show are good on various sorts of data. Overall, this is a thorough review of one area of string matching. There are about 150 references, of which about 15 are by this survey's author. The article is written in an easy to read, non-formal style. There are no theorem/proofs, instead there are a few short proofs and a number of heuristic arguments. There are a few lapses in English and a couple of typos, but these are minor and do not interfere with the readability. This survey should be required reading for those interested in approximate string matching.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!