10.1145/2361354.2361404acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Full-text search on multi-byte encoded documents

Online:04 September 2012Publication History

ABSTRACT

The Burrows Wheeler transform (BWT) has become popular in text compression, full-text search, XML representation, and DNA sequence matching. It is very efficient to perform a full-text search on BWT encoded text using backward search. This paper aims to study different approaches for applying BWT on multi-byte encoded (e.g. UTF-16) text documents. While previous work has studied BWT on word-based models, and BWT can be applied directly on multi-byte encodings (by treating the document as single-byte coded), there has been no extensive study on how to utilize BWT on multi-byte encoded documents for efficient full-text search. Therefore, in this paper, we propose several ways to efficiently backward search multi-byte text documents. We demonstrate our findings using Chinese text documents. Our experiment results show that our extensions to the standard BWT method offer faster search performance and use less runtime memory.

References

  1. A. Andersson, N. Larsson, and K. Swanson. Suffix trees on words. Algorithmica, 23(3):246--260, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, Palo Alto, CA, 1994.Google ScholarGoogle Scholar
  3. P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching XML data via two zips. In WWW 2006, Edinburgh, Scotland, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS '00, pages 390--398, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52(4):552--581, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. B. Frakes and R. A. Baeza-Yates, editors. Information Retrieval: Data Structures & Algorithms. Prentice-Hall, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Inenaga and M. Takeda. On-line linear-time construction of word suffix trees. In in Proc. 17th Ann. Symp. on Combinatorial Pattern Matching (CPM'06), pages 60--71. Springer-Verlag, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Y. K. Isal, A. Moffat, and A. C. Ngai. Enhanced word-based block-sorting text compression. In ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer Science. ACS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Karkkainen and E. Ukkonen. Sparse suffix trees. In J.-Y. Cai and C. Wong, editors, Computing and Combinatorics, volume 1090 of Lecture Notes in Computer Science, pages 219--230. Springer Berlin / Heidelberg, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589--595, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Makinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic J. of Computing, 12(1):40--66, Mar. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Computing, 22(5):935--948, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262--272, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. H. Ong and S. Y. Huang. A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary. Information Sciences, 84(1-2):85--99, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Sadakane. A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression. In DCC: Data Compression Conference. IEEE Computer Society, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Sadakane. Unifying Text Search and Compression: Suffix Sorting, Block Sorting and Suffix Arrays. PhD thesis, The University of Tokyo, 2000.Google ScholarGoogle Scholar
  17. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Yoshida, T. Morihara, H. Yahagi, and N. Satoh. Application of a word-based text compression method to Japanese and Chinese texts. In Data Compression Conference, DCC '99. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Full-text search on multi-byte encoded documents

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!