skip to main content
research-article

A Fast and Compact Language Model Implementation Using Double-Array Structures

Published:29 April 2016Publication History
Skip Abstract Section

Abstract

The language model is a widely used component in fields such as natural language processing, automatic speech recognition, and optical character recognition. In particular, statistical machine translation uses language models, and the translation speed and the amount of memory required are greatly affected by the performance of the language model implementation.

We propose a fast and compact implementation of n-gram language models that increases query speed and reduces memory usage by using a double-array structure, which is known to be a fast and compact trie data structure. We propose two types of implementation: one for backward suffix trees and the other for reverse tries. The data structure is optimized for space efficiency by embedding model parameters into otherwise unused spaces in the double-array structure.

We show that the reverse trie version of our method is among the smallest state-of-the-art implementations in terms of model size with almost the same speed as the implementation that performs fastest on perplexity calculation tasks. Similarly, we achieve faster decoding while keeping compact model sizes, and we confirm that our method can utilize the efficiency of the double-array structure to achieve a balance between speed and size on translation tasks.

References

  1. JunIchi Aoe. 1989. An efficient digital search algorithm by using a double-array structure. IEEE Transactions on Software Engineering 15, 9, 1066--1077. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger. 2009. Hash, displace, and compress. In Proceedings of the 17th European Symposium on Algorithms. 682--693.Google ScholarGoogle ScholarCross RefCross Ref
  3. Timothy C. Bell, John G. Cleary, and Ian H. Witten. 1990. Text Compression. Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 858--867.Google ScholarGoogle Scholar
  5. David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics 33, 2, 201--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Philip Clarkson and Donald Rosenfeld. 1997. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech’97). 2707—2710.Google ScholarGoogle Scholar
  7. Jacob Devlin, Chris Quirk, and Arul Menezes. 2015. Pre-computable multi-layer neural network language models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 256--260.Google ScholarGoogle ScholarCross RefCross Ref
  8. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In Proceedings of the 9th Annual Conference of the International Speech Communication Association (Interspeech’08). 1--4.Google ScholarGoogle Scholar
  9. Edward Fredkin. 1960. Trie memory. Communications of the ACM 3, 9, 490--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kimmo Fredriksson and Fedor Nikitin. 2007. Simple compression code supporting random access and fast string matching. In Proceedings of the 6th International Conference on Experimental Algorithms (Workshop on Experimental Algorithms’07). 203--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ulrich Germann, Eric Joanis, and Samuel Larkin. 2009. Tightly packed tries: How to fit large models into memory, and make them load fast, too. In Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing. 31--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Isao Goto, Ka Pa Chow, Bin Lu, Eiichiro Sumita, and Benjamin K. Tsou. 2013. Overview of the patent machine translation task at the NTCIR-10 workshop. In Proceedings of the 8th Annual Informatics Spring Research Conference (NTCIR’13). 260--286.Google ScholarGoogle Scholar
  13. David Guthrie and Mark Hepple. 2010. Storing the Web in memory: Space efficient language models with constant time retrieval. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. 187--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kenneth Heafield, Hieu Hoang, Philipp Koehn, Tetsuo Kiso, and Marcello Federico. 2011. Left language model state for syntactic machine translation. In Proceedings of the International Workshop on Spoken Language Translation. 183--190.Google ScholarGoogle Scholar
  16. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690--696.Google ScholarGoogle Scholar
  17. Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE, Los Alamitos, CA, 549--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Frederick Jelinek. 1990. Self-organized language modeling for speech recognition. In Readings in Speech Recognition. Morgan Kaufmann Publishers, 450--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Slava M. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 3, 400--401.Google ScholarGoogle Scholar
  20. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Zhifei Li and Sanjeev Khudanpur. 2008. A scalable decoder for parsing-based machine translation with equivalent language model state maintenance. In Proceedings of the HLT 2nd Workshop on Syntax and Structure in Statistical Translation (SSST-2). 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He. 2011. Compression methods by code mapping and code dividing for Chinese dictionary stored in a double-array trie. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 1189--1197.Google ScholarGoogle Scholar
  23. Yasumasa Nakamura and Hisatoshi Mochizuki. 2006. Fast computation of updating method of a dictionary for compression digital search tree. Information Processing Society of Japan 47, 13, 16--27.Google ScholarGoogle Scholar
  24. Adam Pauls and Dan Klein. 2011. Faster and smaller n-gram language models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 258--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Bhiksha Raj and Edward W. D. Whittaker. 2003. Lossless compression of language model structure and word identifiers. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. I-388--I-391.Google ScholarGoogle Scholar
  26. Hidemi Shigekoshi, Takuma Kuramitsu, and Hisatoshi Mochizuki. 2009. Managing transitions of double-array for fast insertion and deletion. In Proceedings of the Forum on Information Technology. 1--6.Google ScholarGoogle Scholar
  27. Jeffrey Sorensen and Cyril Allauzen. 2011. Unary data structures for language models. In Proceedings of the 12th International Speech Communication Association Annual Conference (Interspeech’11). 2--5.Google ScholarGoogle Scholar
  28. A. Stolcke. 2002. SRILM—an extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing.Google ScholarGoogle Scholar
  29. David Talbot and Thorsten Brants. 2008. Randomized language models via perfect hash functions. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT).Google ScholarGoogle Scholar
  30. Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. 2009. A succinct n-gram language model. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. 341--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Makoto Yasuhara, Toru Tanaka, Jun-Ya Norimatsu, and Mikio Yamamoto. 2013. An efficient language model using double-array structures. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 222--232.Google ScholarGoogle Scholar
  32. Susumu Yata, Masahiro Tamura, Kazuhiro Morita, Masao Fuketa, and JunIchi Aoe. 2009. Sequential insertions and performance evaluations for double-arrays. In Proceedings of the 71st National Convention of the Information Processing Society of Japan. 263--264.Google ScholarGoogle Scholar
  33. Naoki Yoshinaga and Masaru Kitsuregawa. 2014. A self-adaptive classifier for efficient text-stream processing. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 1091--1102.Google ScholarGoogle Scholar

Index Terms

  1. A Fast and Compact Language Model Implementation Using Double-Array Structures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 4
      June 2016
      173 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2915955
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 April 2016
      • Accepted: 1 January 2016
      • Revised: 1 November 2015
      • Received: 1 September 2014
      Published in tallip Volume 15, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!