skip to main content
research-article

Improving the Robustness of Loanword Identification in Social Media Texts

Published:24 March 2023Publication History
Skip Abstract Section

Abstract

As a potential bilingual resource, loanwords play a very important role in many natural language processing tasks. If loanwords in a low-resource language can be identified effectively, the generated donor-receipt word pairs will benefit many cross-lingual natural language processing tasks. However, most studies on loanword identification mainly focus on formal texts such as news and government documents. Loanword identification in social media texts is still an under-studied field. Since it faces many challenges and can be widely used in several downstream tasks, more efforts should be put on loanword identification in social media texts. In this study, we present a multi-task learning architecture with deep bi-directional recurrent neural networks for loanword identification in social media texts, where different task supervision can happen at different layers. The multi-task neural network architecture learns higher-order feature representations from word and character sequences along with basic spell error checking, part-of-speech tagging, and named entity recognition information. Experimental results on Uyghur loanword identification in social media texts in five donor languages (Chinese, Arabic, Russian, Turkish, and Farsi) show that our method achieves the best performance compared with several strong baseline systems. We also combine the loanword detection results into the training data of neural machine translation for low-resource language pairs. Experiments show that models trained on the extended datasets achieve significant improvements compared with the baseline models in all language pairs.

REFERENCES

  1. [1] Aguilar Gustavo, Maharjan Suraj, López-Monroy Adrian Pastor, and Solorio Thamar. 2017. A multi-task approach for named entity recognition in social media data. In Proceedings of the 3rd Workshop on Noisy User-Generated Text. 148153. DOI: DOI: https://doi.org/10.18653/v1/W17-4419Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Aisaiti Sulaiman. 2020. Loanwords in Uyghur in a historical and socio-cultural perspective. Uluslararası Uygur Araştırmaları Dergisi 2020, 15 (2020), 3169. DOI: DOI: https://doi.org/10.46400/uygur.712733Google ScholarGoogle Scholar
  3. [3] Attia Mohammed, Pecina Pavel, Samih Younes, Shaalan Khaled, and Genabith Josef Van. 2012. Improved spelling error detection and correction for Arabic. In Proceedings of COLING 2012: Posters. 103112.Google ScholarGoogle Scholar
  4. [4] Chen Yufeng, Zong Chengqing, and Su Keh-Yih. 2013. A joint model to identify and align bilingual named entities. Computational Linguistics 39, 2 (2013), 229266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cheng Jianhong, Liu Jin, Kuang Hulin, and Wang Jianxin. 2022. A fully automated multimodal MRI-based multi-task learning for glioma segmentation and IDH genotyping. IEEE Transactions on Medical Imaging 41, 6 (2022), 15201532. DOI: DOI: https://doi.org/10.1109/TMI.2022.3142321Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cheng Ming, Xiong Shufeng, Li Fei, Liang Pan, and Gao Jianbo. 2021. Multi-task learning for Chinese clinical named entity recognition with external knowledge. BMC Medical Informatics and Decision Making 21, 1 (Dec. 2021), 372. DOI: DOI: https://doi.org/10.1186/s12911-021-01717-1Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Cheng Yong, Tu Zhaopeng, Meng Fandong, Zhai Junjie, and Liu Yang. 2018. Towards robust neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 17561766. DOI: DOI: https://doi.org/10.18653/v1/P18-1163Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Choi Han Woo. 1988. On some Chinese loan words in Uighur. Central Asiatic Journal 32, 3-4 (1988), 161169.Google ScholarGoogle Scholar
  9. [9] Dong Xinshuai, Luu Anh Tuan, Ji Rongrong, and Liu Hong. 2021. Towards robustness against natural language word substitutions. arXiv preprint arXiv:2107.13541 (2021).Google ScholarGoogle Scholar
  10. [10] Farzindar Atefeh and Inkpen Diana. 2015. Natural Language Processing for Social Media. Synthesis Lectures on Human Language Technologies. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Gal Yarin and Ghahramani Zoubin. 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning. 10501059.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Ghamizi Salah, Cordy Maxime, Papadakis Mike, and Traon Yves Le. 2021. Adversarial robustness in multi-task learning: Promises and illusions. arXiv preprint arXiv:2110.15053 (2021).Google ScholarGoogle Scholar
  13. [13] Hardie Andrew. 2003. Developing a tagset for automated part-of-speech tagging in Urdu. In Proceedings of the 2003 Corpus Linguistics Conference.Google ScholarGoogle Scholar
  14. [14] Hashimoto Daiki. 2019. Sociolinguistic effects on loanword phonology: Topic in speech and cultural image. Laboratory Phonology: Journal of the Association for Laboratory Phonology 10, 1 (2019), Article 11.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hodge Victoria J. and Austin Jim. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering 15, 5 (2003), 10731081.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hoffer Bates L.. 2005. Language borrowing and the indices of adaptability and receptivity. Intercultural Communication Studies 14, 2 (2005), 53.Google ScholarGoogle Scholar
  17. [17] Junczys-Dowmunt Marcin, Dwojak Tomasz, and Sennrich Rico. 2016. The AMU-UEDIN submission to the WMT16 news translation task: Attention-based NMT models as feature functions in phrase-based SMT. In Proceedings of the 1st Conference on Machine Translation (Volume 2: Shared Task Papers). 319325. DOI: DOI: https://doi.org/10.18653/v1/W16-2316Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Klyachko Elena, Sorokin Alexey, Krizhanovskaya Natalia, Krizhanovsky Andrew, and Ryazanskaya Galina. 2020. LowResourceEval-2019: A shared task on morphological analysis for low-resource languages. arXiv preprint arXiv:2001.11285 (2020).Google ScholarGoogle Scholar
  19. [19] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 10971105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Laitonjam Lenin, Singh Loitongbam Gyanendro, and Singh Sanasam Ranbir. 2018. Transliteration of English loanwords and named-entities to Manipuri: Phoneme vs Grapheme representation. In Proceedings of the 2018 International Conference on Asian Language Processing (IALP’18). IEEE, Los Alamitos, CA, 255260.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Lample Guillaume, Conneau Alexis, Ranzato Marc’Aurelio, Denoyer Ludovic, and Jégou Hervé. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=H196sainb.Google ScholarGoogle Scholar
  22. [22] Li Xiaonan, Yan Hang, Qiu Xipeng, and Huang Xuanjing. 2020. FLAT: Chinese NER using flat-lattice transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 68366842. DOI: DOI: https://doi.org/10.18653/v1/2020.acl-main.611Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Liu Jin, Li Min, Lan Wei, Wu Fang-Xiang, Pan Yi, and Wang Jianxin. 2018. Classification of Alzheimer’s disease using whole brain hierarchical network. IEEE/ACM Transactions on Computational Biology and Bioinformatics 15, 2 (2018), 624632. DOI: DOI: https://doi.org/10.1109/TCBB.2016.2635144Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Ma Xuezhe and Hovy Eduard. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10641074. DOI: DOI: https://doi.org/10.18653/v1/P16-1101Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Mehmood Tahir, Gerevini Alfonso E., Lavelli Alberto, and Serina Ivan. 2020. Combining multi-task learning with transfer learning for biomedical named entity recognition. Procedia Computer Science 176 (2020), 848857. DOI: DOI: https://doi.org/10.1016/j.procs.2020.09.080Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Mi Chenggang, Xie Lei, and Zhang Yanning. 2020. Loanword identification in low-resource languages with minimal supervision. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 3 (Feb. 2020), Article 43, 22 pages. DOI: DOI: https://doi.org/10.1145/3374212Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mi Chenggang, Yang Yating, Wang Lei, Zhou Xi, and Jiang Tonghai. 2018. A neural network based model for loanword identification in Uyghur. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). https://aclanthology.org/L18-1565.Google ScholarGoogle Scholar
  28. [28] Mi Chenggang, Yang Yating, Wang Lei, Zhou Xi, and Jiang Tonghai. 2018. Toward better loanword identification in Uyghur using cross-lingual word embeddings. In Proceedings of the 27th International Conference on Computational Linguistics. 30273037. https://aclanthology.org/C18-1256.Google ScholarGoogle Scholar
  29. [29] Miller John, Pariasca Emanuel, and Castañon Cesar Beltran. 2021. Neural borrowing detection with monolingual lexical models. In Proceedings of the Student Research Workshop Associated with RANLP 2021. 109117. https://aclanthology.org/2021.ranlp-srw.16.Google ScholarGoogle Scholar
  30. [30] Miller John E., Tresoldi Tiago, Zariquiey Roberto, Castañón César A. Beltrán, Morozova Natalia, and List Johann-Mattis. 2020. Using lexical language models to detect borrowings in monolingual wordlists. PLoS One 15, 12 (Dec. 2020), 123. DOI: DOI: https://doi.org/10.1371/journal.pone.0242709Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Molęda Jacek. 2008. Phonological adaptations of anglicisms in Polish and Czech. A critical view. Bohemistyka VIII. (2008), 295–308. DOI: http://hdl.handle.net/10593/9269Google ScholarGoogle Scholar
  32. [32] Ngo Quoc Hung, Dien Dinh, and Winiwarter Werner. 2014. Building English-Vietnamese named entity corpus with aligned bilingual news articles. In Proceedings of the 5th Workshop on South and Southeast Asian Natural Language Processing. 8593. DOI: DOI: https://doi.org/10.3115/v1/W14-5512Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] San Maocuo, Cai Zhijie, Cai Rangzhuoma, and Dao Jizhaxi. 2021. Analysis on types of spelling errors in true Tibetan characters. In MATEC Web of Conferences, Vol. 336. EDP Sciences, 06019.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Schuster Mike and Paliwal Kuldip K.. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 26732681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 17151725. DOI: DOI: https://doi.org/10.18653/v1/P16-1162Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Shen Hong and Sarkar Anoop. 2005. Voting between multiple data representations for text chunking. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. 389400.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Søgaard Anders and Goldberg Yoav. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 231235. DOI: DOI: https://doi.org/10.18653/v1/P16-2038Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Spektor Yulia. 2021. Detection and Morphological Analysis of Novel Russian Loanwords. Master’s thesis. Graduate Center, City University of New York.Google ScholarGoogle Scholar
  39. [39] Tsvetkov Yulia, Ammar Waleed, and Dyer Chris. 2015. Constraint-based models of lexical borrowing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 598608. DOI: DOI: https://doi.org/10.3115/v1/N15-1062Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Tsvetkov Yulia and Dyer Chris. 2015. Lexicon stratification for translating out-of-vocabulary words. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 125131. DOI: DOI: https://doi.org/10.3115/v1/P15-2021Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  42. [42] Wang Tianlu, Wang Xuezhi, Qin Yao, Packer Ben, Li Kang, Chen Jilin, Beutel Alex, and Chi Ed. 2020. CAT-Gen: Improving robustness in NLP models via controlled adversarial text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 51415146. DOI: DOI: https://doi.org/10.18653/v1/2020.emnlp-main.417Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Watanabe Taiki, Tamura Akihiro, Ninomiya Takashi, Makino Takuya, and Iwakura Tomoya. 2019. Multi-task learning for chemical named entity recognition with chemical compound paraphrasing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 62446249. DOI: DOI: https://doi.org/10.18653/v1/D19-1648Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Yan Hang, Deng Bocao, Li Xiaonan, and Qiu Xipeng. 2019. TENER: Adapting transformer encoder for named entity recognition. arXiv preprint arXiv:1911.04474 (2019).Google ScholarGoogle Scholar
  45. [45] Zhang Yu and Yang Qiang. 2022. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering 34, 12 (2022), 5586–5609.Google ScholarGoogle Scholar
  46. [46] Zhao Sendong, Liu Ting, Zhao Sicheng, and Wang Fei. 2019. A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI ’19/IAAI ’19/EAAI ’19). Article 101, 8 pages. DOI: DOI: https://doi.org/10.1609/aaai.v33i01.3301817Google ScholarGoogle Scholar
  47. [47] Zhou Shuyan, Zeng Xiangkai, Zhou Yingqi, Anastasopoulos Antonios, and Neubig Graham. 2019. Improving robustness of neural machine translation with multi-task learning. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 565571.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Kingma Diederik P. and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings 3rd International Conference on Learning Representations (ICLR’15), Yoshua Bengio and Yann LeCun (Eds.). San Diego, CA. http://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar

Index Terms

  1. Improving the Robustness of Loanword Identification in Social Media Texts

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Asian and Low-Resource Language Information Processing
            ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
            April 2023
            682 pages
            ISSN:2375-4699
            EISSN:2375-4702
            DOI:10.1145/3588902
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 March 2023
            • Online AM: 23 November 2022
            • Accepted: 18 November 2022
            • Revised: 17 August 2022
            • Received: 21 January 2022
            Published in tallip Volume 22, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)133
            • Downloads (Last 6 weeks)14

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!