Abstract
Bilingual resources play a very important role in many natural language processing tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time consuming to build such resources. Lexical borrowing happens in almost every language. This inspires us to detect these loanwords effectively, and to use the “loanword (in receipt language)”-“donor word (in donor language)” to extend the bilingual resource for NLP tasks in low-resource languages. In this article, we propose a novel method to identify loanwords in Uyghur. The most important advantage of this method is that the model only relies on large amount of monolingual corpora and only a small scale of annotated data. Our loanword identification model includes two parts: loanword candidate generation and loanword prediction. In the first part, we use two large-scale monolingual corpora and a small bilingual dictionary to train a cross-lingual embedding model. Since semantic unrelated words often cannot be treated as loanword pairs, a loanword candidate list will be generated according to this model and a word list in Uyghur. In the second part, we predict from the preceding candidates based on a log-linear model that integrates several features such as pronunciation similarity, part-of-speech tags, and hybrid language modeling. To evaluate the effectiveness of our proposed method, we conduct two types of experiments: loanword identification and OOV translation. Experimental results showed that (1) our proposed method achieved significant F1 improvements compared to other models in all four loanword identification tasks in Uyghur, and (2) after extending the existing translation models with loanword identification results, OOV rates in several language pairs reduced significantly and the translation performance improved.
- Wafia Adouane, Simon Dobnik, Jean-Philippe Bernardy, and Nasredine Semmar. 2018. A comparison of character neural language model and bootstrapping for language identification in multilingual noisy texts. In Proceedings of the 2nd Workshop on Subword/Character LEvel Models (SCLeM’18). 22--31. DOI:https://doi.org/10.18653/v1/W18-1203Google Scholar
Cross Ref
- Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. arXiv:1602.01925.Google Scholar
- Antonio Barone and Valerio Miceli. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996.Google Scholar
- Andre Cianflone and Leila Kosseim. 2016. N-gram and neural language models for discriminating similar languages. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial3’16). 243--250. https://www.aclweb.org/anthology/W16-4831.Google Scholar
- Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1285--1295. DOI:https://doi.org/10.18653/v1/D16-1136Google Scholar
Cross Ref
- Philip Durkin (Ed.). 2014. Borrowed Words: A History of Loanwords in English. Oxford University Press. https://global.oup.com/academic/product/borrowed-words-9780199574995?cc=;cn8lang=;en8#.Google Scholar
Cross Ref
- Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an unsupervised transliteration model into statistical machine translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers. 148--153. DOI:https://doi.org/10.3115/v1/E14-4029Google Scholar
- Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 462--471.Google Scholar
Cross Ref
- Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi Reichart, and Anna Korhonen. 2018. On the relation between linguistic typology and (limitations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 316--327. http://aclweb.org/anthology/D18-1029.Google Scholar
Cross Ref
- Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1: Long Papers. 1234--1244.Google Scholar
Cross Ref
- Martin Haspelmath and Uri Tadmor (Eds.). 2009. The World Loanword Database (WOLD). Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany. https://wold.clld.org/.Google Scholar
- Bates L. Hoffer. 2005. Language borrowing and the indices of adaptability and receptivity. Intercultural Communication Studies 14, 2 (2005), 53.Google Scholar
- Kejun Huang, Matt Gardner, Evangelos Papalexakis, Christos Faloutsos, Nikos Sidiropoulos, Tom Mitchell, Partha P. Talukdar, and Xiao Fu. 2015. Translation invariant word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1084--1088.Google Scholar
Cross Ref
- Aaron Jaech, George Mulcaire, Mari Ostendorf, and Noah A. Smith. 2016. A neural model for language identification in code-switched tweets. In Proceedings of the 2nd Workshop on Computational Approaches to Code Switching. 60--64.Google Scholar
- Tommi Sakari Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Lindén. 2019. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research 65 (2019), 675--782.Google Scholar
Digital Library
- Yoonjung Kang, Andrea Hoa Pham, and Benjamin Storme. 2016. French loanwords in Vietnamese: The role of input language phonotactics and contrast in loanword adaptation. In Proceedings of the Annual Meetings on Phonology, Vol. 2.Google Scholar
Cross Ref
- Alina Karakanta, Jon Dehdari, and Josef van Genabith. 2018. Neural machine translation for low-resource languages without parallel corpora. Machine Translation 32, 1 (June 2018), 167--189. DOI:https://doi.org/10.1007/s10590-017-9203-5Google Scholar
Digital Library
- Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for M-gram language modeling. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95), Vol. 1. 181e4.Google Scholar
Cross Ref
- Nicholas D. Kontovas. 2008. An Analysis of Recent Loans into the Standard Uyghur Lexicon. Retrieved February 4, 2020 from https://www.aclweb.org/anthology/C18-1256.pdf.Google Scholar
- John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). 282--289.Google Scholar
Digital Library
- Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language Information Processing 14, 3 (2015), 11.Google Scholar
Digital Library
- Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers.1054--1063. DOI:https://doi.org/10.18653/v1/P16-1100Google Scholar
- Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 151--159.Google Scholar
- Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1: Long Papers.11--19. DOI:https://doi.org/10.3115/v1/P15-1002Google Scholar
Cross Ref
- Richard T. McCoy and Robert Frank. 2018. Phonologically informed edit distance algorithms for word alignment with low-resource languages. In Proceedings of the Society for Computation in Linguistics (SCiL’18).102--112.Google Scholar
- Chenggang Mi, Yating Yang, Lei Wang, Xiao Li, and Kamali Dalielihan. 2014. Detection of loan words in Uyghur texts. In Natural Language Processing and Chinese Computing. Springer, 103--112.Google Scholar
- Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, and Tonghai Jiang. 2018. A neural network based model for loanword identification in Uyghur. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). http://aclweb.org/anthology/L18-1565.Google Scholar
- Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, and Tonghai Jiang. 2018. Toward better loanword identification in Uyghur using cross-lingual word embeddings. In Proceedings of the 27th International Conference on Computational Linguistics. 3027--3037. http://aclweb.org/anthology/C18-1256.Google Scholar
- Chenggang Mi, Yating Yang, Xi Zhou, Lei Wang, Xiao Li, and Tonghai Jiang. 2016. Recurrent neural network based loanwords identification in Uyghur. In Proceedings of the 30th Pacific Asia Conference on Language, Information, and Computation: Oral Papers. 209--217.Google Scholar
- Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arxiv:1309.4168.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.Google Scholar
- Aditya Mogadala and Achim Rettinger. 2016. Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 692--702.Google Scholar
Cross Ref
- Elham Mohammadi, Hadi Veisi, and Hessam Amini. 2017. Native language identification using a mixture of character and word n-grams. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 210--216. DOI:https://doi.org/10.18653/v1/W17-5022Google Scholar
Cross Ref
- Graham Neubig and Chris Dyer. 2016. Generalizing and hybridizing count-based and neural language models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1163--1172. DOI:https://doi.org/10.18653/v1/D16-1124Google Scholar
Cross Ref
- Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Volume 1. 160--167.Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318.Google Scholar
- Sharon Peperkamp. 2004. A psycholinguistic theory of loanword adaptations. In Proceedings of the Annual Meeting of the Berkeley Linguistics Society, Vol. 30. 341--352.Google Scholar
Cross Ref
- Sree Harsha Ramesh and Krishna Prasad Sankaranarayanan. 2018. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 112--119.Google Scholar
Cross Ref
- Henry G. Schwarz. 1992. An Uyghur-English Dictionary. Western Washington University.Google Scholar
- Gökhan Akın Şeker and Gülşen Eryiğit. 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1. Semantic Web 8, 5 (2017), 625--642.Google Scholar
Cross Ref
- Jacqueline Serigos. 2017. Using distributional semantics in loanword research: A concept-based approach to quantifying semantic specificity of Anglicisms in Spanish. International Journal of Bilingualism 21, 5 (2017), 521--540.Google Scholar
Cross Ref
- Shigeko Shinohara. 2015. Loanword-specific grammar in Japanese adaptations of Korean words and phrases. Journal of East Asian Linguistics 24, 2 (2015), 149--191.Google Scholar
Cross Ref
- Andreas Stolcke. 2002. SRILM—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing.Google Scholar
- Yulia Tsvetkov, Waleed Ammar, and Chris Dyer. 2015. Constraint-based models of lexical borrowing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 598--608.Google Scholar
Cross Ref
- Yulia Tsvetkov and Chris Dyer. 2015. Lexicon stratification for translating out-of-vocabulary words. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 2: Short Papers. 125--131.Google Scholar
Cross Ref
- Ivan Vulić and Marie-Francine Moens. 2016. Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55 (2016), 953--994.Google Scholar
Digital Library
- Matthew D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. arXiv:1212.5701.Google Scholar
- Jiajun Zhang and Chengqing Zong. 2016. Bridging neural machine translation and bilingual dictionaries. arXiv:1610.07272.Google Scholar
Index Terms
Loanword Identification in Low-Resource Languages with Minimal Supervision
Recommendations
Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually ...
Improving the Robustness of Loanword Identification in Social Media Texts
As a potential bilingual resource, loanwords play a very important role in many natural language processing tasks. If loanwords in a low-resource language can be identified effectively, the generated donor-receipt word pairs will benefit many cross-...
Loanword identification based on web resources: A case study on wikipedia
AbstractTo alleviate the resource scarcity and improve the robustness in loanword identification, the current study proposes a novel loanword identification method based on Wikipedia. In this paper, we first present how to obtain loanword ...
Highlights- Obtain loanword candidate datasets and comparable corpora from Wikipedia.
- A ...






Comments