skip to main content
research-article

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

Published:13 November 2017Publication History
Skip Abstract Section

Abstract

The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.

References

  1. Carlos Ansótegui, María Luisa Bonet, and Jordi Levy. 2009. Solving (weighted) partial MaxSAT through satisfiability testing. In Theory and Applications of Satisfiability Testing-SAT 2009. Springer, 427--440.Google ScholarGoogle Scholar
  2. Armin Biere, Marijn Heule, and Hans van Maaren. 2009. Handbook of Satisfiability. Vol. 185. IOS Press.Google ScholarGoogle Scholar
  3. Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Comput. Ling. 16, 2 (1990), 79--85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lyle Campbell. 2013. Historical Linguistics. Edinburgh University Press.Google ScholarGoogle Scholar
  5. Lyle Campbell and William J. Poser. 2008. Language classification. History and Method. Cambridge University Press, Cambridge (2008). Google ScholarGoogle ScholarCross RefCross Ref
  6. Hervé Déjean, Éric Gaussier, and Fatia Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1 (COLING’02). Association for Computational Linguistics, Stroudsburg, PA, 1--7. DOI:http://dx.doi.org/10.3115/1072228.1072394 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the 3rd Workshop on Very Large Corpora. 173--183.Google ScholarGoogle Scholar
  8. Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17. Google ScholarGoogle ScholarCross RefCross Ref
  9. Charlotte Gooskens. 2006. Linguistic and extra-linguistic predictors of inter-scandinavian intelligibility. Ling. Netherlands 23, 1 (2006), 101--113.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In Proceedings of the International Semantic Web Conference. Springer, 130--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eric W. Holman, Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, and others. 2011. Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 52, 6 (2011), 841--875. Google ScholarGoogle ScholarCross RefCross Ref
  12. Toru Ishida. 2011. The Language Grid: Service-Oriented Collective Intelligence for Language Resource Interoperability. Springer. Google ScholarGoogle ScholarCross RefCross Ref
  13. Winfred P. Lehmann. 2013. Historical Linguistics: An Introduction. Routledge.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. Paul Lewis, Gary F. Simons, and Charles D. Fennig (Eds.). 2015. Ethnologue: Languages of the World (18th ed.). SIL International, Dallas, TX.Google ScholarGoogle Scholar
  15. Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies. Association for Computational Linguistics, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22.Google ScholarGoogle Scholar
  17. I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing N-best translation lexicons. CoRR cmp-lg/9505044 (1995). Retrieved from http://arxiv.org/abs/cmp-lg/9505044.Google ScholarGoogle Scholar
  18. Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. 44 (2012), 179--222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Preslav Nakov and Jörg Tiedemann. 2012. Combining word-level and character-level models for machine translation between closely related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Volume 2. Association for Computational Linguistics, 301--305.Google ScholarGoogle Scholar
  20. Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2016. Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). Paris, France, 3291--3298.Google ScholarGoogle Scholar
  21. Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 320--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. John Richardson, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Pivot-based topic models for low-resource lexicon extraction. In Proceedings of the Pacific Asia Conference on Language, Information and Computation (PACLIC’15).Google ScholarGoogle Scholar
  23. C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann, Newton, MA.Google ScholarGoogle Scholar
  24. Xabier Saralegi, Iker Manterola, and Inaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kevin P. Scannell. 2006. Machine translation for closely related language pairs. In Proceedings of the Workshop Strategies for Developing Machine Translation for Minority Languages. Citeseer, 103--109.Google ScholarGoogle Scholar
  26. Lloyd S. Shapley. 1953. A value for n-person games. Contrib. Theor. Games 2, 28 (1953), 307--317. Google ScholarGoogle ScholarCross RefCross Ref
  27. Gary F. Simons and Charles D. Fennig (eds.). 2017. Ethnologue: Languages of the World, 20th ed. (2017).Google ScholarGoogle Scholar
  28. Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 623--632. DOI:http://dx.doi.org/10.1145/1321440.1321528 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Stephen Soderland, Oren Etzioni, Daniel S Weld, Michael Skinner, Jeff Bilmes, and others. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1. Association for Computational Linguistics, 262--270.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. Int. J. Am. Ling. 21, 2 (1955), 121--137. Google ScholarGoogle ScholarCross RefCross Ref
  31. Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, 297--303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09), Vol. 2009. 1555--1561.Google ScholarGoogle Scholar
  33. Jörg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proceedings of the 13th Conference of the European Association for Machine Translation (EAMT’09). 12--19.Google ScholarGoogle Scholar
  34. Renee Van Bezooijen and Charlotte Gooskens. 2005. How easy is it for speakers of dutch to understand frisian and afrikaans, and why? Ling. Netherlands 22, 1 (2005), 13--24.Google ScholarGoogle ScholarCross RefCross Ref
  35. Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21, 3 (01 Sep 2007), 165--181. DOI:http://dx.doi.org/10.1007/s10590-008-9041-6 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2014. Pivot-Based Bilingual Dictionary Extraction from Multiple Dictionary Resources. Springer International, Cham, 221--234. DOI:http://dx.doi.org/10.1007/978-3-319-13560-1_18 Google ScholarGoogle ScholarCross RefCross Ref
  37. Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2015. A constraint approach to pivot-based bilingual dictionary induction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 1, Article 4 (Nov. 2015), 26 pages. DOI:http://dx.doi.org/10.1145/2723144 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!