ABSTRACT
The Burrows Wheeler transform (BWT) has become popular in text compression, full-text search, XML representation, and DNA sequence matching. It is very efficient to perform a full-text search on BWT encoded text using backward search. This paper aims to study different approaches for applying BWT on multi-byte encoded (e.g. UTF-16) text documents. While previous work has studied BWT on word-based models, and BWT can be applied directly on multi-byte encodings (by treating the document as single-byte coded), there has been no extensive study on how to utilize BWT on multi-byte encoded documents for efficient full-text search. Therefore, in this paper, we propose several ways to efficiently backward search multi-byte text documents. We demonstrate our findings using Chinese text documents. Our experiment results show that our extensions to the standard BWT method offer faster search performance and use less runtime memory.
References
- A. Andersson, N. Larsson, and K. Swanson. Suffix trees on words. Algorithmica, 23(3):246--260, 1999.Google Scholar
Cross Ref
- M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, Palo Alto, CA, 1994.Google Scholar
- P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching XML data via two zips. In WWW 2006, Edinburgh, Scotland, 2006. ACM. Google Scholar
Digital Library
- P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS '00, pages 390--398, Washington, DC, USA, 2000. IEEE Computer Society. Google Scholar
Digital Library
- P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52(4):552--581, July 2005. Google Scholar
Digital Library
- W. B. Frakes and R. A. Baeza-Yates, editors. Information Retrieval: Data Structures & Algorithms. Prentice-Hall, 1992. Google Scholar
Digital Library
- S. Inenaga and M. Takeda. On-line linear-time construction of word suffix trees. In in Proc. 17th Ann. Symp. on Combinatorial Pattern Matching (CPM'06), pages 60--71. Springer-Verlag, 2006. Google Scholar
Digital Library
- R. Y. K. Isal, A. Moffat, and A. C. Ngai. Enhanced word-based block-sorting text compression. In ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer Science. ACS, 2002. Google Scholar
Digital Library
- J. Karkkainen and E. Ukkonen. Sparse suffix trees. In J.-Y. Cai and C. Wong, editors, Computing and Combinatorics, volume 1090 of Lecture Notes in Computer Science, pages 219--230. Springer Berlin / Heidelberg, 1996. Google Scholar
Digital Library
- H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589--595, 2010. Google Scholar
Digital Library
- V. Makinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic J. of Computing, 12(1):40--66, Mar. 2005. Google Scholar
Digital Library
- U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Computing, 22(5):935--948, 1993. Google Scholar
Digital Library
- E. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262--272, 1976. Google Scholar
Digital Library
- G. H. Ong and S. Y. Huang. A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary. Information Sciences, 84(1-2):85--99, 1995. Google Scholar
Digital Library
- K. Sadakane. A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression. In DCC: Data Compression Conference. IEEE Computer Society, 1999. Google Scholar
Digital Library
- K. Sadakane. Unifying Text Search and Compression: Suffix Sorting, Block Sorting and Suffix Arrays. PhD thesis, The University of Tokyo, 2000.Google Scholar
- I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 1999. Google Scholar
Digital Library
- S. Yoshida, T. Morihara, H. Yahagi, and N. Satoh. Application of a word-based text compression method to Japanese and Chinese texts. In Data Compression Conference, DCC '99. IEEE, 1999. Google Scholar
Digital Library
Index Terms
Full-text search on multi-byte encoded documents





Comments