skip to main content
note

Loanword Identification in Low-Resource Languages with Minimal Supervision

Published:20 February 2020Publication History
Skip Abstract Section

Abstract

Bilingual resources play a very important role in many natural language processing tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time consuming to build such resources. Lexical borrowing happens in almost every language. This inspires us to detect these loanwords effectively, and to use the “loanword (in receipt language)”-“donor word (in donor language)” to extend the bilingual resource for NLP tasks in low-resource languages. In this article, we propose a novel method to identify loanwords in Uyghur. The most important advantage of this method is that the model only relies on large amount of monolingual corpora and only a small scale of annotated data. Our loanword identification model includes two parts: loanword candidate generation and loanword prediction. In the first part, we use two large-scale monolingual corpora and a small bilingual dictionary to train a cross-lingual embedding model. Since semantic unrelated words often cannot be treated as loanword pairs, a loanword candidate list will be generated according to this model and a word list in Uyghur. In the second part, we predict from the preceding candidates based on a log-linear model that integrates several features such as pronunciation similarity, part-of-speech tags, and hybrid language modeling. To evaluate the effectiveness of our proposed method, we conduct two types of experiments: loanword identification and OOV translation. Experimental results showed that (1) our proposed method achieved significant F1 improvements compared to other models in all four loanword identification tasks in Uyghur, and (2) after extending the existing translation models with loanword identification results, OOV rates in several language pairs reduced significantly and the translation performance improved.

References

  1. Wafia Adouane, Simon Dobnik, Jean-Philippe Bernardy, and Nasredine Semmar. 2018. A comparison of character neural language model and bootstrapping for language identification in multilingual noisy texts. In Proceedings of the 2nd Workshop on Subword/Character LEvel Models (SCLeM’18). 22--31. DOI:https://doi.org/10.18653/v1/W18-1203Google ScholarGoogle ScholarCross RefCross Ref
  2. Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. arXiv:1602.01925.Google ScholarGoogle Scholar
  3. Antonio Barone and Valerio Miceli. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996.Google ScholarGoogle Scholar
  4. Andre Cianflone and Leila Kosseim. 2016. N-gram and neural language models for discriminating similar languages. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial3’16). 243--250. https://www.aclweb.org/anthology/W16-4831.Google ScholarGoogle Scholar
  5. Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1285--1295. DOI:https://doi.org/10.18653/v1/D16-1136Google ScholarGoogle ScholarCross RefCross Ref
  6. Philip Durkin (Ed.). 2014. Borrowed Words: A History of Loanwords in English. Oxford University Press. https://global.oup.com/academic/product/borrowed-words-9780199574995?cc=;cn8lang=;en8#.Google ScholarGoogle ScholarCross RefCross Ref
  7. Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an unsupervised transliteration model into statistical machine translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers. 148--153. DOI:https://doi.org/10.3115/v1/E14-4029Google ScholarGoogle Scholar
  8. Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 462--471.Google ScholarGoogle ScholarCross RefCross Ref
  9. Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi Reichart, and Anna Korhonen. 2018. On the relation between linguistic typology and (limitations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 316--327. http://aclweb.org/anthology/D18-1029.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1: Long Papers. 1234--1244.Google ScholarGoogle ScholarCross RefCross Ref
  11. Martin Haspelmath and Uri Tadmor (Eds.). 2009. The World Loanword Database (WOLD). Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany. https://wold.clld.org/.Google ScholarGoogle Scholar
  12. Bates L. Hoffer. 2005. Language borrowing and the indices of adaptability and receptivity. Intercultural Communication Studies 14, 2 (2005), 53.Google ScholarGoogle Scholar
  13. Kejun Huang, Matt Gardner, Evangelos Papalexakis, Christos Faloutsos, Nikos Sidiropoulos, Tom Mitchell, Partha P. Talukdar, and Xiao Fu. 2015. Translation invariant word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1084--1088.Google ScholarGoogle ScholarCross RefCross Ref
  14. Aaron Jaech, George Mulcaire, Mari Ostendorf, and Noah A. Smith. 2016. A neural model for language identification in code-switched tweets. In Proceedings of the 2nd Workshop on Computational Approaches to Code Switching. 60--64.Google ScholarGoogle Scholar
  15. Tommi Sakari Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Lindén. 2019. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research 65 (2019), 675--782.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yoonjung Kang, Andrea Hoa Pham, and Benjamin Storme. 2016. French loanwords in Vietnamese: The role of input language phonotactics and contrast in loanword adaptation. In Proceedings of the Annual Meetings on Phonology, Vol. 2.Google ScholarGoogle ScholarCross RefCross Ref
  17. Alina Karakanta, Jon Dehdari, and Josef van Genabith. 2018. Neural machine translation for low-resource languages without parallel corpora. Machine Translation 32, 1 (June 2018), 167--189. DOI:https://doi.org/10.1007/s10590-017-9203-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for M-gram language modeling. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95), Vol. 1. 181e4.Google ScholarGoogle ScholarCross RefCross Ref
  19. Nicholas D. Kontovas. 2008. An Analysis of Recent Loans into the Standard Uyghur Lexicon. Retrieved February 4, 2020 from https://www.aclweb.org/anthology/C18-1256.pdf.Google ScholarGoogle Scholar
  20. John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). 282--289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language Information Processing 14, 3 (2015), 11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers.1054--1063. DOI:https://doi.org/10.18653/v1/P16-1100Google ScholarGoogle Scholar
  23. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 151--159.Google ScholarGoogle Scholar
  24. Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1: Long Papers.11--19. DOI:https://doi.org/10.3115/v1/P15-1002Google ScholarGoogle ScholarCross RefCross Ref
  25. Richard T. McCoy and Robert Frank. 2018. Phonologically informed edit distance algorithms for word alignment with low-resource languages. In Proceedings of the Society for Computation in Linguistics (SCiL’18).102--112.Google ScholarGoogle Scholar
  26. Chenggang Mi, Yating Yang, Lei Wang, Xiao Li, and Kamali Dalielihan. 2014. Detection of loan words in Uyghur texts. In Natural Language Processing and Chinese Computing. Springer, 103--112.Google ScholarGoogle Scholar
  27. Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, and Tonghai Jiang. 2018. A neural network based model for loanword identification in Uyghur. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). http://aclweb.org/anthology/L18-1565.Google ScholarGoogle Scholar
  28. Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, and Tonghai Jiang. 2018. Toward better loanword identification in Uyghur using cross-lingual word embeddings. In Proceedings of the 27th International Conference on Computational Linguistics. 3027--3037. http://aclweb.org/anthology/C18-1256.Google ScholarGoogle Scholar
  29. Chenggang Mi, Yating Yang, Xi Zhou, Lei Wang, Xiao Li, and Tonghai Jiang. 2016. Recurrent neural network based loanwords identification in Uyghur. In Proceedings of the 30th Pacific Asia Conference on Language, Information, and Computation: Oral Papers. 209--217.Google ScholarGoogle Scholar
  30. Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arxiv:1309.4168.Google ScholarGoogle Scholar
  31. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.Google ScholarGoogle Scholar
  32. Aditya Mogadala and Achim Rettinger. 2016. Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 692--702.Google ScholarGoogle ScholarCross RefCross Ref
  33. Elham Mohammadi, Hadi Veisi, and Hessam Amini. 2017. Native language identification using a mixture of character and word n-grams. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 210--216. DOI:https://doi.org/10.18653/v1/W17-5022Google ScholarGoogle ScholarCross RefCross Ref
  34. Graham Neubig and Chris Dyer. 2016. Generalizing and hybridizing count-based and neural language models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1163--1172. DOI:https://doi.org/10.18653/v1/D16-1124Google ScholarGoogle ScholarCross RefCross Ref
  35. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Volume 1. 160--167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318.Google ScholarGoogle Scholar
  37. Sharon Peperkamp. 2004. A psycholinguistic theory of loanword adaptations. In Proceedings of the Annual Meeting of the Berkeley Linguistics Society, Vol. 30. 341--352.Google ScholarGoogle ScholarCross RefCross Ref
  38. Sree Harsha Ramesh and Krishna Prasad Sankaranarayanan. 2018. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 112--119.Google ScholarGoogle ScholarCross RefCross Ref
  39. Henry G. Schwarz. 1992. An Uyghur-English Dictionary. Western Washington University.Google ScholarGoogle Scholar
  40. Gökhan Akın Şeker and Gülşen Eryiğit. 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1. Semantic Web 8, 5 (2017), 625--642.Google ScholarGoogle ScholarCross RefCross Ref
  41. Jacqueline Serigos. 2017. Using distributional semantics in loanword research: A concept-based approach to quantifying semantic specificity of Anglicisms in Spanish. International Journal of Bilingualism 21, 5 (2017), 521--540.Google ScholarGoogle ScholarCross RefCross Ref
  42. Shigeko Shinohara. 2015. Loanword-specific grammar in Japanese adaptations of Korean words and phrases. Journal of East Asian Linguistics 24, 2 (2015), 149--191.Google ScholarGoogle ScholarCross RefCross Ref
  43. Andreas Stolcke. 2002. SRILM—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing.Google ScholarGoogle Scholar
  44. Yulia Tsvetkov, Waleed Ammar, and Chris Dyer. 2015. Constraint-based models of lexical borrowing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 598--608.Google ScholarGoogle ScholarCross RefCross Ref
  45. Yulia Tsvetkov and Chris Dyer. 2015. Lexicon stratification for translating out-of-vocabulary words. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 2: Short Papers. 125--131.Google ScholarGoogle ScholarCross RefCross Ref
  46. Ivan Vulić and Marie-Francine Moens. 2016. Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55 (2016), 953--994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Matthew D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. arXiv:1212.5701.Google ScholarGoogle Scholar
  48. Jiajun Zhang and Chengqing Zong. 2016. Bridging neural machine translation and bilingual dictionaries. arXiv:1610.07272.Google ScholarGoogle Scholar

Index Terms

  1. Loanword Identification in Low-Resource Languages with Minimal Supervision

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!