Abstract
The Language Model (LM) is an essential component of Statistical Machine Translation (SMT). In this article, we focus on developing efficient methods for LM construction. Our main contribution is that we propose a Natural N-grams based Converting (NNGC) method for transforming a Continuous-Space Language Model (CSLM) to a Back-off N-gram Language Model (BNLM). Furthermore, a Bilingual LM Pruning (BLMP) approach is developed for enhancing LMs in SMT decoding and speeding up CSLM converting. The proposed pruning and converting methods can convert a large LM efficiently by working jointly. That is, a LM can be effectively pruned before it is converted from CSLM without sacrificing performance, and further improved if an additional corpus contains out-of-domain information. For different SMT tasks, our experimental results indicate that the proposed NNGC and BLMP methods outperform the existing counterpart approaches significantly in BLEU and computational cost.
- Ebru Arsoy, Stanley F. Chen, Bhuvana Ramabhadran, and Abhinav Sethy. 2013. Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). IEEE, 8242--8246.Google Scholar
Cross Ref
- Ebru Arsoy, Stanley F. Chen, Bhuvana Ramabhadran, and Abhinav Sethy. 2014. Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 1 (2014), 184--192. DOI:http://dx.doi.org/10.1109/TASLP.2013.2286919 Google Scholar
Digital Library
- Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, 1044--1054.Google Scholar
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research (JMLR) 3 (March 2003), 1137--1155. Google Scholar
Digital Library
- Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics. 858--867.Google Scholar
- Ciprian Chelba, Thorsten Brants, Will Neveitt, and Peng Xu. 2010. Study on interaction between entropy pruning and kneser-ney smoothing. In Proceedings of The Annual Conference of the International Speech Communication Association. 2242--2245.Google Scholar
- Stanley F. Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL’96). 310--318. DOI:http://dx.doi.org/10.3115/981863.981904 Google Scholar
Digital Library
- Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13, 4 (1999), 359--393. DOI:http://dx.doi.org/10.1006/csla.1999.0128 Google Scholar
Digital Library
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724--1734.Google Scholar
Cross Ref
- Anoop Deoras, Tomas Mikolov, Stefan Kombrink, Martin Karafiát, and Sanjeev Khudanpur. 2011. Variational approximation of long-span language models for LVCSR. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE, 5532--5535. DOI:http://dx.doi.org/10.1109/ICASSP.2011.5947612Google Scholar
Cross Ref
- Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1370--1380.Google Scholar
Cross Ref
- Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, and Takehito Utsuro. 2010. Overview of the patent translation task at the NTCIR-8 workshop. In Proceedings of the 8th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access. 293--302.Google Scholar
- Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng. 2014. Learning continuous phrase representations for translation modeling. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 699--709.Google Scholar
Cross Ref
- Isao Goto, Bin Lu, Ka Po Chow, Eiichiro Sumita, and Benjamin K. Tsou. 2011. Overview of the patent machine translation task at the NTCIR-9 workshop. In Proceedings of NTCIR-9 Workshop Meeting. 559--578.Google Scholar
- Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 297--304.Google Scholar
- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 690--696.Google Scholar
- Zhongqiang Huang, Jacob Devlin, and Spyros Matsoukas. 2013. BBN’s systems for the Chinese-English sub-task of the NTCIR-10 PatentMT evaluation. In Proceedings of NII Testbeds and Community for Information Access Research (NTCIR’10).Google Scholar
- Zhongye Jia and Hai Zhao. 2014. A joint graph model for Pinyin-to-Chinese conversion with typo correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 1512--1523.Google Scholar
Cross Ref
- Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1700--1709.Google Scholar
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Dekang Lin and Dekai Wu (Eds.). 388--395.Google Scholar
- Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1 (NAACL’03). 48--54. DOI:http://dx.doi.org/10.3115/1073445.1073462 Google Scholar
Digital Library
- Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. 1853--1861.Google Scholar
- Hai-Son Le, I. Oparin, A. Allauzen, J. Gauvain, and F. Yvon. 2011. Structured output layer neural network language model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). 5524--5527. DOI:http://dx.doi.org/10.1109/ICASSP.2011.5947610Google Scholar
- Peng Li, Yang Liu, Maosong Sun, Tatsuya Izuha, and Dakun Zhang. 2014. A neural reordering model for phrase-based translation. In Proceedings of, the 25th International Conference on Computational Linguistics: Technical Papers (COLING 2014), 1897--1907. http://www.aclweb.org/anthology/C14-1179.Google Scholar
- lemao Liu, Taro Watanabe, Eiichiro Sumita, and Tiejun Zhao. 2013. Additive neural networks for statistical machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 791--801.Google Scholar
- Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. 2014. A recursive recurrent neural network for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1491--1500.Google Scholar
Cross Ref
- Xuezhe Ma and Hai Zhao. 2012. Fourth-order dependency parsing. In Proceedings of the 24th International Conference on Computational Linguistics. 785--796.Google Scholar
- Tomas Mikolov, Anoop Deoras, Daniel Povey, Lukas Burget, and Jan Cernock. 2011. Strategies for training large scale neural network language models. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). 196--201.Google Scholar
Cross Ref
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association. 1045--1048.Google Scholar
- Andriy Mnih and Geoffrey E. Hinton. 2008. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.). 1081--1088.Google Scholar
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 807--814.Google Scholar
Digital Library
- Jan Niehues and Alex Waibel. 2012. Continuous space language models using restricted Boltzmann machines. In Proceedings of the International Workshop for Spoken Language Translation (IWSLT’12). 311--318.Google Scholar
- Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 160--167. DOI:http://dx.doi.org/10.3115/1075096.1075117 Google Scholar
Digital Library
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (March 2003), 19--51. DOI:http://dx.doi.org/10.1162/089120103321337421 Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). 311--318. DOI:http://dx.doi.org/10.3115/1073083.1073135 Google Scholar
Digital Library
- Xiaochang Peng and Daniel Gildea. 2014. Type-based MCMC for sampling tree fragments from forests. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1735--1745.Google Scholar
Cross Ref
- Brian Roark, Cyril Allauzen, and Michael Riley. 2013. Smoothed marginal distribution constraints for language modeling. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). 43--52.Google Scholar
- Holger Schwenk. 2007. Continuous space language models. Computer Speech and Language 21, 3 (2007), 492--518. DOI:http://dx.doi.org/10.1016/j.csl.2006.09.003 Google Scholar
Digital Library
- Holger Schwenk. 2010. Continuous-space language models for statistical machine translation. The Prague Bulletin of Mathematical Linguistics (2010), 137--146.Google Scholar
- Holger Schwenk, Daniel Dchelotte, and Jean-Luc Gauvain. 2006. Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL’06). 723--730. Google Scholar
Digital Library
- Holger Schwenk, Anthony Rousseau, and Mohammed Attik. 2012. Large, pruned or continuous space language models on a GPU for statistical machine translation. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT (WLM’12). 11--19. Google Scholar
Digital Library
- Vesa Siivola, Teemu Hirsimki, and Sami Virpioja. 2007. On growing and pruning Kneser-Ney smoothed N-gram models. IEEE Transactions on Audio, Speech, and Language Processing 15, 5 (2007), 1617--1624. DOI:http://dx.doi.org/10.1109/TASL.2007.896666 Google Scholar
Digital Library
- Le Hai Son, Alexandre Allauzen, Guillaume Wisniewski, and François Yvon. 2010. Training continuous space language models: Some practical issues. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 778--788. Google Scholar
Digital Library
- Le Hai Son, Alexandre Allauzen, and François Yvon. 2012. Continuous space translation models with neural networks. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’12). 39--48. Google Scholar
Digital Library
- Andreas Stolcke. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. 270--274.Google Scholar
- Andreas Stolcke. 2002. SRILM—An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, 257--286.Google Scholar
- Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash. 2011. SRILM at sixteen: Update and outlook. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop.Google Scholar
- Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker, and Hermann Ney. 2014. Translation modeling with bidirectional recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 14--25.Google Scholar
Cross Ref
- Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large-scale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1387--1392.Google Scholar
- Rui Wang, Masao Utiyama, Isao Goto, Eiichro Sumita, Hai Zhao, and Bao-Liang Lu. 2013. Converting continuous-space language models into N-gram language models for statistical machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 845--850.Google Scholar
- Rui Wang, Hai Zhao, and Bao Liang Lu. 2015a. English to Chinese translation: How Chinese character matters?. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, 274--284.Google Scholar
- Rui Wang, Hai Zhao, Bao Liang Lu, Masao Utiyama, and Eiichiro Sumita. 2014. Neural network based bilingual language model growing for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 189--195.Google Scholar
Cross Ref
- Rui Wang, Hai Zhao, Bao-Liang Lu, M. Utiyama, and E. Sumita. 2015b. Bilingual continuous-space language model growing for statistical machine translation. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 23, 7 (July 2015), 1209--1220. DOI:http://dx.doi.org/10.1109/TASLP.2015.2425220Google Scholar
- Xiaolin Wang, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2014. Empirical study of unsupervised Chinese word segmentation methods for SMT on large-scale corpora. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 752--758. http://www.aclweb.org/anthology/P14-2122.Google Scholar
Cross Ref
- Qiongkai Xu and Hai Zhao. 2012. Using deep linguistic features for finding deceptive opinion spam. In Proceedings of the 24th International Conference on Computational Linguistics. 1341--1350.Google Scholar
- Omar F. Zaidan. 2009. Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics 91 (2009), 79--88.Google Scholar
Cross Ref
- Jingyi Zhang, Masao Utiyama, Eiichiro Sumita, and Hai Zhao. 2014. Learning hierarchical translation spans. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 183--188.Google Scholar
Cross Ref
- Jingyi Zhang and Hai Zhao. 2013. Improving function word alignment with frequency and syntactic information. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2211--2217. Google Scholar
Digital Library
- Hai Zhao. 2009. Character-level dependencies in Chinese: Usefulness and learning. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL’09). 879--887. Google Scholar
Digital Library
- Hai Zhao, Wenliang Chen, and Chunyu Kit. 2009a. Semantic dependency parsing of NomBank and PropBank: An efficient integrated approach via a large-scale feature selection. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 30--39. Google Scholar
Digital Library
- Hai Zhao, Yan Song, Chunyu Kit, and Guodong Zhou. 2009b. Cross language dependency parsing using a bilingual lexicon. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore, 55--63. Google Scholar
Digital Library
- Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for Chinese machine translation. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7817. 248--263. Google Scholar
Digital Library
- Will Y. Zou, Richard Socher, Daniel Cer, and Christopher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1393--1398.Google Scholar
Index Terms
Converting Continuous-Space Language Models into N-gram Language Models with Efficient Bilingual Pruning for Statistical Machine Translation
Recommendations
Bilingual continuous-space language model growing for statistical machine translation
Larger n-gram language models (LMs) perform better in statistical machine translation (SMT). However, the existing approaches have two main drawbacks for constructing larger LMs: 1) it is not convenient to obtain larger corpora in the same domain as the ...
Factored bilingual n-gram language models for statistical machine translation
In this work, we present an extension of n-gram-based translation models based on factored language models (FLMs). Translation units employed in the n-gram-based approach to statistical machine translation (SMT) are based on mappings of sequences of raw ...
Wider context by using bilingual language models in machine translation
WMT '11: Proceedings of the Sixth Workshop on Statistical Machine TranslationIn past Evaluations for Machine Translation of European Languages, it could be shown that the translation performance of SMT systems can be increased by integrating a bilingual language model into a phrase-based SMT system. In the bilingual language ...






Comments