skip to main content
research-article
Open Access

Converting Continuous-Space Language Models into N-gram Language Models with Efficient Bilingual Pruning for Statistical Machine Translation

Authors Info & Claims
Published:09 January 2016Publication History
Skip Abstract Section

Abstract

The Language Model (LM) is an essential component of Statistical Machine Translation (SMT). In this article, we focus on developing efficient methods for LM construction. Our main contribution is that we propose a Natural N-grams based Converting (NNGC) method for transforming a Continuous-Space Language Model (CSLM) to a Back-off N-gram Language Model (BNLM). Furthermore, a Bilingual LM Pruning (BLMP) approach is developed for enhancing LMs in SMT decoding and speeding up CSLM converting. The proposed pruning and converting methods can convert a large LM efficiently by working jointly. That is, a LM can be effectively pruned before it is converted from CSLM without sacrificing performance, and further improved if an additional corpus contains out-of-domain information. For different SMT tasks, our experimental results indicate that the proposed NNGC and BLMP methods outperform the existing counterpart approaches significantly in BLEU and computational cost.

References

  1. Ebru Arsoy, Stanley F. Chen, Bhuvana Ramabhadran, and Abhinav Sethy. 2013. Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). IEEE, 8242--8246.Google ScholarGoogle ScholarCross RefCross Ref
  2. Ebru Arsoy, Stanley F. Chen, Bhuvana Ramabhadran, and Abhinav Sethy. 2014. Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 1 (2014), 184--192. DOI:http://dx.doi.org/10.1109/TASLP.2013.2286919 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, 1044--1054.Google ScholarGoogle Scholar
  4. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research (JMLR) 3 (March 2003), 1137--1155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics. 858--867.Google ScholarGoogle Scholar
  6. Ciprian Chelba, Thorsten Brants, Will Neveitt, and Peng Xu. 2010. Study on interaction between entropy pruning and kneser-ney smoothing. In Proceedings of The Annual Conference of the International Speech Communication Association. 2242--2245.Google ScholarGoogle Scholar
  7. Stanley F. Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL’96). 310--318. DOI:http://dx.doi.org/10.3115/981863.981904 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13, 4 (1999), 359--393. DOI:http://dx.doi.org/10.1006/csla.1999.0128 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  10. Anoop Deoras, Tomas Mikolov, Stefan Kombrink, Martin Karafiát, and Sanjeev Khudanpur. 2011. Variational approximation of long-span language models for LVCSR. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE, 5532--5535. DOI:http://dx.doi.org/10.1109/ICASSP.2011.5947612Google ScholarGoogle ScholarCross RefCross Ref
  11. Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1370--1380.Google ScholarGoogle ScholarCross RefCross Ref
  12. Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, and Takehito Utsuro. 2010. Overview of the patent translation task at the NTCIR-8 workshop. In Proceedings of the 8th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access. 293--302.Google ScholarGoogle Scholar
  13. Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng. 2014. Learning continuous phrase representations for translation modeling. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 699--709.Google ScholarGoogle ScholarCross RefCross Ref
  14. Isao Goto, Bin Lu, Ka Po Chow, Eiichiro Sumita, and Benjamin K. Tsou. 2011. Overview of the patent machine translation task at the NTCIR-9 workshop. In Proceedings of NTCIR-9 Workshop Meeting. 559--578.Google ScholarGoogle Scholar
  15. Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 297--304.Google ScholarGoogle Scholar
  16. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 690--696.Google ScholarGoogle Scholar
  17. Zhongqiang Huang, Jacob Devlin, and Spyros Matsoukas. 2013. BBN’s systems for the Chinese-English sub-task of the NTCIR-10 PatentMT evaluation. In Proceedings of NII Testbeds and Community for Information Access Research (NTCIR’10).Google ScholarGoogle Scholar
  18. Zhongye Jia and Hai Zhao. 2014. A joint graph model for Pinyin-to-Chinese conversion with typo correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 1512--1523.Google ScholarGoogle ScholarCross RefCross Ref
  19. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1700--1709.Google ScholarGoogle Scholar
  20. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Dekang Lin and Dekai Wu (Eds.). 388--395.Google ScholarGoogle Scholar
  21. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1 (NAACL’03). 48--54. DOI:http://dx.doi.org/10.3115/1073445.1073462 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. 1853--1861.Google ScholarGoogle Scholar
  23. Hai-Son Le, I. Oparin, A. Allauzen, J. Gauvain, and F. Yvon. 2011. Structured output layer neural network language model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). 5524--5527. DOI:http://dx.doi.org/10.1109/ICASSP.2011.5947610Google ScholarGoogle Scholar
  24. Peng Li, Yang Liu, Maosong Sun, Tatsuya Izuha, and Dakun Zhang. 2014. A neural reordering model for phrase-based translation. In Proceedings of, the 25th International Conference on Computational Linguistics: Technical Papers (COLING 2014), 1897--1907. http://www.aclweb.org/anthology/C14-1179.Google ScholarGoogle Scholar
  25. lemao Liu, Taro Watanabe, Eiichiro Sumita, and Tiejun Zhao. 2013. Additive neural networks for statistical machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 791--801.Google ScholarGoogle Scholar
  26. Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. 2014. A recursive recurrent neural network for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1491--1500.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xuezhe Ma and Hai Zhao. 2012. Fourth-order dependency parsing. In Proceedings of the 24th International Conference on Computational Linguistics. 785--796.Google ScholarGoogle Scholar
  28. Tomas Mikolov, Anoop Deoras, Daniel Povey, Lukas Burget, and Jan Cernock. 2011. Strategies for training large scale neural network language models. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). 196--201.Google ScholarGoogle ScholarCross RefCross Ref
  29. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association. 1045--1048.Google ScholarGoogle Scholar
  30. Andriy Mnih and Geoffrey E. Hinton. 2008. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.). 1081--1088.Google ScholarGoogle Scholar
  31. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 807--814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jan Niehues and Alex Waibel. 2012. Continuous space language models using restricted Boltzmann machines. In Proceedings of the International Workshop for Spoken Language Translation (IWSLT’12). 311--318.Google ScholarGoogle Scholar
  33. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 160--167. DOI:http://dx.doi.org/10.3115/1075096.1075117 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (March 2003), 19--51. DOI:http://dx.doi.org/10.1162/089120103321337421 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). 311--318. DOI:http://dx.doi.org/10.3115/1073083.1073135 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xiaochang Peng and Daniel Gildea. 2014. Type-based MCMC for sampling tree fragments from forests. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1735--1745.Google ScholarGoogle ScholarCross RefCross Ref
  37. Brian Roark, Cyril Allauzen, and Michael Riley. 2013. Smoothed marginal distribution constraints for language modeling. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). 43--52.Google ScholarGoogle Scholar
  38. Holger Schwenk. 2007. Continuous space language models. Computer Speech and Language 21, 3 (2007), 492--518. DOI:http://dx.doi.org/10.1016/j.csl.2006.09.003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Holger Schwenk. 2010. Continuous-space language models for statistical machine translation. The Prague Bulletin of Mathematical Linguistics (2010), 137--146.Google ScholarGoogle Scholar
  40. Holger Schwenk, Daniel Dchelotte, and Jean-Luc Gauvain. 2006. Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL’06). 723--730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Holger Schwenk, Anthony Rousseau, and Mohammed Attik. 2012. Large, pruned or continuous space language models on a GPU for statistical machine translation. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT (WLM’12). 11--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Vesa Siivola, Teemu Hirsimki, and Sami Virpioja. 2007. On growing and pruning Kneser-Ney smoothed N-gram models. IEEE Transactions on Audio, Speech, and Language Processing 15, 5 (2007), 1617--1624. DOI:http://dx.doi.org/10.1109/TASL.2007.896666 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Le Hai Son, Alexandre Allauzen, Guillaume Wisniewski, and François Yvon. 2010. Training continuous space language models: Some practical issues. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 778--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Le Hai Son, Alexandre Allauzen, and François Yvon. 2012. Continuous space translation models with neural networks. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’12). 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Andreas Stolcke. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. 270--274.Google ScholarGoogle Scholar
  46. Andreas Stolcke. 2002. SRILM—An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, 257--286.Google ScholarGoogle Scholar
  47. Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash. 2011. SRILM at sixteen: Update and outlook. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop.Google ScholarGoogle Scholar
  48. Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker, and Hermann Ney. 2014. Translation modeling with bidirectional recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 14--25.Google ScholarGoogle ScholarCross RefCross Ref
  49. Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large-scale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1387--1392.Google ScholarGoogle Scholar
  50. Rui Wang, Masao Utiyama, Isao Goto, Eiichro Sumita, Hai Zhao, and Bao-Liang Lu. 2013. Converting continuous-space language models into N-gram language models for statistical machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 845--850.Google ScholarGoogle Scholar
  51. Rui Wang, Hai Zhao, and Bao Liang Lu. 2015a. English to Chinese translation: How Chinese character matters?. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, 274--284.Google ScholarGoogle Scholar
  52. Rui Wang, Hai Zhao, Bao Liang Lu, Masao Utiyama, and Eiichiro Sumita. 2014. Neural network based bilingual language model growing for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 189--195.Google ScholarGoogle ScholarCross RefCross Ref
  53. Rui Wang, Hai Zhao, Bao-Liang Lu, M. Utiyama, and E. Sumita. 2015b. Bilingual continuous-space language model growing for statistical machine translation. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 23, 7 (July 2015), 1209--1220. DOI:http://dx.doi.org/10.1109/TASLP.2015.2425220Google ScholarGoogle Scholar
  54. Xiaolin Wang, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2014. Empirical study of unsupervised Chinese word segmentation methods for SMT on large-scale corpora. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 752--758. http://www.aclweb.org/anthology/P14-2122.Google ScholarGoogle ScholarCross RefCross Ref
  55. Qiongkai Xu and Hai Zhao. 2012. Using deep linguistic features for finding deceptive opinion spam. In Proceedings of the 24th International Conference on Computational Linguistics. 1341--1350.Google ScholarGoogle Scholar
  56. Omar F. Zaidan. 2009. Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics 91 (2009), 79--88.Google ScholarGoogle ScholarCross RefCross Ref
  57. Jingyi Zhang, Masao Utiyama, Eiichiro Sumita, and Hai Zhao. 2014. Learning hierarchical translation spans. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 183--188.Google ScholarGoogle ScholarCross RefCross Ref
  58. Jingyi Zhang and Hai Zhao. 2013. Improving function word alignment with frequency and syntactic information. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2211--2217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Hai Zhao. 2009. Character-level dependencies in Chinese: Usefulness and learning. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL’09). 879--887. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Hai Zhao, Wenliang Chen, and Chunyu Kit. 2009a. Semantic dependency parsing of NomBank and PropBank: An efficient integrated approach via a large-scale feature selection. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Hai Zhao, Yan Song, Chunyu Kit, and Guodong Zhou. 2009b. Cross language dependency parsing using a bilingual lexicon. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore, 55--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for Chinese machine translation. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7817. 248--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Will Y. Zou, Richard Socher, Daniel Cer, and Christopher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1393--1398.Google ScholarGoogle Scholar

Index Terms

  1. Converting Continuous-Space Language Models into N-gram Language Models with Efficient Bilingual Pruning for Statistical Machine Translation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!