skip to main content
research-article
Open Access

Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

Authors Info & Claims
Published:13 February 2018Publication History
Skip Abstract Section

Abstract

We propose a new method for inducing a phrase-based translation model from a pair of unrelated monolingual corpora. Our method is able to deal with phrases of arbitrary length and to find phrase pairs that are useful for statistical machine translation, without requiring large parallel or comparable corpora. First, our method generates phrase pairs through coupling source and target phrases separately collected from respective monolingual data. Then, for each phrase pair, we compute features using the monolingual data and a small quantity of parallel sentences. Finally, incorrect phrase pairs are pruned, and a phrase table is made using the remaining phrase pairs. In our experiments on French--Japanese and Spanish--Japanese translation tasks under low-resource conditions, we observe that incorporating a phrase table induced by our method to the machine translation system leads to large improvements in translation quality. Furthermore, we show that a phrase table induced by our method can also be useful in a wide range of configurations, including configurations where we have already access to large parallel corpora and configurations where only small monolingual corpora are available.

References

  1. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the International Conference on Computational Linguistics (COLING’16).Google ScholarGoogle ScholarCross RefCross Ref
  2. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chenhui Chu and Sadao Kurohashi. 2016. Paraphrasing out-of-vocabulary words with word embeddings and semantic lexicons for low resource statistical machine translation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  4. Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Annual Conference of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Trevor Cohn and Mirella Lapata. 2007. Machine translation by triangulation: Making effective use of multi-parallel corpora. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’07).Google ScholarGoogle Scholar
  6. Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wenzek, and Amine Benhalloum. 2015. Trans-gram, fast cross-lingual word-embeddings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15).Google ScholarGoogle ScholarCross RefCross Ref
  7. Leonard Dahlmann, Evgeny Matusov, Pavel Petrushkov, and Shahram Khadivi. 2017. Neural machine translation leveraging phrase-based models in a hybrid search. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  8. Hal Daumé III and Jagadeesh Jagarlamudi. 2011. Domain adaptation for machine translation by mining unseen words. In Proceedings of the Annual Conference of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Meiping Dong, Yang Liu, Huanbo Luan, Maosong Sun, Tatsuya Izuha, and Dakun Zhang. 2015. Iterative learning of parallel lexicons and phrases from non-parallel corpora. In Proceedings of the International Joint Conferences on Artificial Intelligence Organization (IJCAI’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Qing Dou and Kevin Knight. 2012. Large scale decipherment for out-of-domain machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the Conference on Natural Language Learning (EMNLP-CoNLL’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16).Google ScholarGoogle ScholarCross RefCross Ref
  12. Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In Proceedings of the 3rd Workshop on Very Large Corpora.Google ScholarGoogle Scholar
  13. Ulrich Germann. 2001. Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In Proceedings of the Conference of the Association for Computational Linguistics Workshop on Data-Driven Methods in Machine Translation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the Conference of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’08).Google ScholarGoogle Scholar
  15. Jingyi Han and Núria Bel. 2016. Towards producing bilingual lexica from monolingual corpora. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  16. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified kneser-ney language model estimation. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’13).Google ScholarGoogle Scholar
  17. Sanjika Hewavitharana and Stephan Vogel. 2016. Extracting parallel phrases from comparable data for machine translation. Nat. Lang. Eng. 22, 4 (2016), 549--573.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ann Irvine and Chris Callison-Burch. 2013. Supervised bilingual lexicon induction with multiple monolingual signals. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’13).Google ScholarGoogle Scholar
  19. Ann Irvine and Chris Callison-Burch. 2014. Hallucinating phrase translations for low resource MT. In Proceedings of the Conference on Natural Language Learning (CoNLL’14).Google ScholarGoogle ScholarCross RefCross Ref
  20. Ann Irvine and Chris Callison-Burch. 2016. End-to-end statistical machine translation with zero or small parallel texts. Nat. Lang. Eng. 22, 4 (2016), 517--548.Google ScholarGoogle ScholarCross RefCross Ref
  21. Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the Conference on Natural Language Learning (EMNLP-CoNLL’07).Google ScholarGoogle Scholar
  22. Alex Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012. Toward statistical machine translation without parallel corpora. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the Conference of the Association for Computational Linguistics Workshop on Unsupervised Lexical Acquisition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation.Google ScholarGoogle ScholarCross RefCross Ref
  26. Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. CoRR abs/1309.4168 (2013).Google ScholarGoogle Scholar
  27. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science 34, 8 (2010), 1388--1429.Google ScholarGoogle ScholarCross RefCross Ref
  29. Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Deciphering foreign language by combining language models and context vectors. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Peyman Passban, Qun Liu, and Andy Way. 2016. Enriching phrase tables for statistical machine translation using mixed embeddings. In Proceedings of the International Conference on Computational Linguistics (COLING’16).Google ScholarGoogle Scholar
  32. Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’95). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of the Annual Conference of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Avneesh Saluja, Hany Hassan, Kristina Toutanova, and Chris Quirk. 2014. Graph-based semi-supervised learning of translation models from monolingual data. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’14).Google ScholarGoogle ScholarCross RefCross Ref
  35. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’13).Google ScholarGoogle Scholar
  36. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13).Google ScholarGoogle Scholar
  37. Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’16).Google ScholarGoogle ScholarCross RefCross Ref
  38. Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’07).Google ScholarGoogle Scholar
  39. Ivan Vulić and Anna Korhonen. 2016. On the role of seed lexicons in learning bilingual word embeddings. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’16).Google ScholarGoogle Scholar
  40. Ivan Vulić and Marie-Francine Moens. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’15).Google ScholarGoogle Scholar
  41. Ivan Vulić and Marie-Francine Moens. 2016. Bilingual distributed word representations from document-aligned comparable data. Journal of Articial Intelligence Research 55 (2016), 953--994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Kai Zhao, Hany Hassan, and Michael Auli. 2015. Learning translation models from monolingual continuous representations. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’15).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!