skip to main content
research-article

Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences

Authors Info & Claims
Published:17 March 2017Publication History
Skip Abstract Section

Abstract

This article proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical nonparametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information and additionally can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations and can use reliably aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process, and it is furthermore capable of operating in the absence of full segmentation information. In this work, we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to baselines based on a simpler models that use only word-pair frequency information. Our results show that the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baselines.

References

  1. Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A Gibbs sampler for phrasal synchronous grammar induction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL. 782--790. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kareem Darwish. 2010. Transliteration mining with phonetic conflation and iterative training. In Proceedings of the 2010 Named Entities Workshop. 53--56. http://www.aclweb.org/anthology/W10-2407 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Andrew Finch and Eiichiro Sumita. 2010. A Bayesian model of bilingual segmentation for transliteration. In Proceedings of the 7th International Workshop on Spoken Language Translation (IWSLT’10). 259--266.Google ScholarGoogle Scholar
  4. Andrew M. Finch, Ohnmar Htun, and Eiichiro Sumita. 2012. The NICT translation system for IWSLT 2012. In Proceedings of the 2012 International Workshop on Spoken Language Translation (IWSLT’12). 121--125. http://www.isca-speech.org/archive/iwslt_12/sltc_121.html.Google ScholarGoogle Scholar
  5. Takaaki Fukunishi, Andrew Michael Finch, Eiichiro Sumita, and Seiichi Yamamoto. 2013. A Bayesian alignment approach to transliteration mining. ACM Transactions on Asian Language Information Processing 12, 3, Article No. 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-44). 673--680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ohnmar Htun, Andrew Finch, Eiichiro Sumita, and Yoshiki Mikami. 2012. Improving transliteration mining by integrating expert knowledge with statistical approaches. International Journal of Computer Applications 58, Article No. 17.Google ScholarGoogle Scholar
  8. Hemant Ishwaran and Lancelot F. James. 2003. Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica 13, 4, 1211--1235.Google ScholarGoogle Scholar
  9. Sittichai Jiampojamarn, Kenneth Dwyer, Shane Bergsma, Aditya Bhargava, Qing Dou, Mi-Young Kim, and Grzegorz Kondrak. 2010. Transliteration generation and mining with limited training resources. In Proceedings of the 2010 Named Entities Workshop. 39--47. http://www.aclweb.org/anthology/W10-2405 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics 24, 4, 599--612. http://www.aclweb.org/anthology/J98-4003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Kumaran, Mitesh M. Khapra, and Haizhou Li. 2010. Report of NEWS 2010 transliteration mining shared task. In Proceedings of the 2010 Named Entities Workshop. 21--28. http://www.aclweb.org/anthology/W10-2403 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Abby Levenberg, Chris Dyer, and Phil Blunsom. 2012. A Bayesian model for learning SCFGs with discontiguous rules. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 223--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Tingting Li, Tiejun Zhao, Andrew Finch, and Chunyue Zhang. 2013. A tightly-coupled unsupervised clustering and bilingual alignment model for transliteration. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 393--398. http://www.aclweb.org/anthology/P13-2070Google ScholarGoogle Scholar
  14. D. Lopresti, A. Tomkins, and J. Zhou. 1997. Algorithms for matching hand-drawn sketches. In Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition. 233--238.Google ScholarGoogle Scholar
  15. Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL-IJCNLP’09). 100--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011. An unsupervised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11). 632--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sara Noeman and Amgad Madkour. 2010. Language independent transliteration mining system using finite state automata framework. In Proceedings of the 2010 Named Entities Workshop. 57--61. http://www.aclweb.org/anthology/W10-2408 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1, 19--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jim Pitman and Marc Yor. 1995. The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator. Retrieved November 19, 2016, from http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/433.pdf.Google ScholarGoogle Scholar
  20. Eric Sven Ristad and Peter N. Yianilos. 1998. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence 20, 5, 522--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2012. A statistical model for unsupervised and semi-supervised transliteration mining. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 469--477. http://www.aclweb.org/anthology/P12-1049 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Steven L. Scott. 2002. Bayesian methods for hidden Markov models: Recursive computing in the 21st century. Journal of the American Statistical Association 97, 457, 337--351. http://www.jstor.org/stable/3085787Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 3
        September 2017
        167 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3041821
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 March 2017
        • Accepted: 1 September 2016
        • Revised: 1 June 2016
        • Received: 1 November 2015
        Published in tallip Volume 16, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!