Abstract
As a potential bilingual resource, loanwords play a very important role in many natural language processing tasks. If loanwords in a low-resource language can be identified effectively, the generated donor-receipt word pairs will benefit many cross-lingual natural language processing tasks. However, most studies on loanword identification mainly focus on formal texts such as news and government documents. Loanword identification in social media texts is still an under-studied field. Since it faces many challenges and can be widely used in several downstream tasks, more efforts should be put on loanword identification in social media texts. In this study, we present a multi-task learning architecture with deep bi-directional recurrent neural networks for loanword identification in social media texts, where different task supervision can happen at different layers. The multi-task neural network architecture learns higher-order feature representations from word and character sequences along with basic spell error checking, part-of-speech tagging, and named entity recognition information. Experimental results on Uyghur loanword identification in social media texts in five donor languages (Chinese, Arabic, Russian, Turkish, and Farsi) show that our method achieves the best performance compared with several strong baseline systems. We also combine the loanword detection results into the training data of neural machine translation for low-resource language pairs. Experiments show that models trained on the extended datasets achieve significant improvements compared with the baseline models in all language pairs.
- [1] . 2017. A multi-task approach for named entity recognition in social media data. In Proceedings of the 3rd Workshop on Noisy User-Generated Text. 148–153.
DOI: DOI: https://doi.org/10.18653/v1/W17-4419Google ScholarCross Ref
- [2] . 2020. Loanwords in Uyghur in a historical and socio-cultural perspective. Uluslararası Uygur Araştırmaları Dergisi 2020, 15 (2020), 31–69.
DOI: DOI: https://doi.org/10.46400/uygur.712733Google Scholar - [3] . 2012. Improved spelling error detection and correction for Arabic. In Proceedings of COLING 2012: Posters. 103–112.Google Scholar
- [4] . 2013. A joint model to identify and align bilingual named entities. Computational Linguistics 39, 2 (2013), 229–266.Google Scholar
Digital Library
- [5] . 2022. A fully automated multimodal MRI-based multi-task learning for glioma segmentation and IDH genotyping. IEEE Transactions on Medical Imaging 41, 6 (2022), 1520–1532.
DOI: DOI: https://doi.org/10.1109/TMI.2022.3142321Google ScholarCross Ref
- [6] . 2021. Multi-task learning for Chinese clinical named entity recognition with external knowledge. BMC Medical Informatics and Decision Making 21, 1 (Dec. 2021), 372.
DOI: DOI: https://doi.org/10.1186/s12911-021-01717-1Google ScholarCross Ref
- [7] . 2018. Towards robust neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1756–1766.
DOI: DOI: https://doi.org/10.18653/v1/P18-1163Google ScholarCross Ref
- [8] . 1988. On some Chinese loan words in Uighur. Central Asiatic Journal 32, 3-4 (1988), 161–169.Google Scholar
- [9] . 2021. Towards robustness against natural language word substitutions. arXiv preprint arXiv:2107.13541 (2021).Google Scholar
- [10] . 2015. Natural Language Processing for Social Media. Synthesis Lectures on Human Language Technologies. Springer.Google Scholar
Cross Ref
- [11] . 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning. 1050–1059.Google Scholar
Digital Library
- [12] . 2021. Adversarial robustness in multi-task learning: Promises and illusions. arXiv preprint arXiv:2110.15053 (2021).Google Scholar
- [13] . 2003. Developing a tagset for automated part-of-speech tagging in Urdu. In Proceedings of the 2003 Corpus Linguistics Conference.Google Scholar
- [14] . 2019. Sociolinguistic effects on loanword phonology: Topic in speech and cultural image. Laboratory Phonology: Journal of the Association for Laboratory Phonology 10, 1 (2019), Article 11.Google Scholar
Cross Ref
- [15] . 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering 15, 5 (2003), 1073–1081.Google Scholar
Digital Library
- [16] . 2005. Language borrowing and the indices of adaptability and receptivity. Intercultural Communication Studies 14, 2 (2005), 53.Google Scholar
- [17] . 2016. The AMU-UEDIN submission to the WMT16 news translation task: Attention-based NMT models as feature functions in phrase-based SMT. In Proceedings of the 1st Conference on Machine Translation (Volume 2: Shared Task Papers). 319–325.
DOI: DOI: https://doi.org/10.18653/v1/W16-2316Google ScholarCross Ref
- [18] . 2020. LowResourceEval-2019: A shared task on morphological analysis for low-resource languages. arXiv preprint arXiv:2001.11285 (2020).Google Scholar
- [19] . 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.Google Scholar
Digital Library
- [20] . 2018. Transliteration of English loanwords and named-entities to Manipuri: Phoneme vs Grapheme representation. In Proceedings of the 2018 International Conference on Asian Language Processing (IALP’18). IEEE, Los Alamitos, CA, 255–260.Google Scholar
Cross Ref
- [21] . 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=H196sainb.Google Scholar
- [22] . 2020. FLAT: Chinese NER using flat-lattice transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6836–6842.
DOI: DOI: https://doi.org/10.18653/v1/2020.acl-main.611Google ScholarCross Ref
- [23] . 2018. Classification of Alzheimer’s disease using whole brain hierarchical network. IEEE/ACM Transactions on Computational Biology and Bioinformatics 15, 2 (2018), 624–632.
DOI: DOI: https://doi.org/10.1109/TCBB.2016.2635144Google ScholarCross Ref
- [24] . 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064–1074.
DOI: DOI: https://doi.org/10.18653/v1/P16-1101Google ScholarCross Ref
- [25] . 2020. Combining multi-task learning with transfer learning for biomedical named entity recognition. Procedia Computer Science 176 (2020), 848–857.
DOI: DOI: https://doi.org/10.1016/j.procs.2020.09.080Google ScholarCross Ref
- [26] . 2020. Loanword identification in low-resource languages with minimal supervision. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 3 (Feb. 2020), Article 43, 22 pages.
DOI: DOI: https://doi.org/10.1145/3374212Google ScholarDigital Library
- [27] . 2018. A neural network based model for loanword identification in Uyghur. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). https://aclanthology.org/L18-1565.Google Scholar
- [28] . 2018. Toward better loanword identification in Uyghur using cross-lingual word embeddings. In Proceedings of the 27th International Conference on Computational Linguistics. 3027–3037. https://aclanthology.org/C18-1256.Google Scholar
- [29] . 2021. Neural borrowing detection with monolingual lexical models. In Proceedings of the Student Research Workshop Associated with RANLP 2021. 109–117. https://aclanthology.org/2021.ranlp-srw.16.Google Scholar
- [30] . 2020. Using lexical language models to detect borrowings in monolingual wordlists. PLoS One 15, 12 (Dec. 2020), 1–23.
DOI: DOI: https://doi.org/10.1371/journal.pone.0242709Google ScholarCross Ref
- [31] . 2008. Phonological adaptations of anglicisms in Polish and Czech. A critical view. Bohemistyka VIII. (2008), 295–308.
DOI: http://hdl.handle.net/10593/9269Google Scholar - [32] . 2014. Building English-Vietnamese named entity corpus with aligned bilingual news articles. In Proceedings of the 5th Workshop on South and Southeast Asian Natural Language Processing. 85–93.
DOI: DOI: https://doi.org/10.3115/v1/W14-5512Google ScholarCross Ref
- [33] . 2021. Analysis on types of spelling errors in true Tibetan characters. In MATEC Web of Conferences, Vol. 336. EDP Sciences, 06019.Google Scholar
Cross Ref
- [34] . 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- [35] . 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1715–1725.
DOI: DOI: https://doi.org/10.18653/v1/P16-1162Google ScholarCross Ref
- [36] . 2005. Voting between multiple data representations for text chunking. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. 389–400.Google Scholar
Digital Library
- [37] . 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 231–235.
DOI: DOI: https://doi.org/10.18653/v1/P16-2038Google ScholarCross Ref
- [38] . 2021. Detection and Morphological Analysis of Novel Russian Loanwords. Master’s thesis. Graduate Center, City University of New York.Google Scholar
- [39] . 2015. Constraint-based models of lexical borrowing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 598–608.
DOI: DOI: https://doi.org/10.3115/v1/N15-1062Google ScholarCross Ref
- [40] . 2015. Lexicon stratification for translating out-of-vocabulary words. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 125–131.
DOI: DOI: https://doi.org/10.3115/v1/P15-2021Google ScholarCross Ref
- [41] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [42] . 2020. CAT-Gen: Improving robustness in NLP models via controlled adversarial text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 5141–5146.
DOI: DOI: https://doi.org/10.18653/v1/2020.emnlp-main.417Google ScholarCross Ref
- [43] . 2019. Multi-task learning for chemical named entity recognition with chemical compound paraphrasing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 6244–6249.
DOI: DOI: https://doi.org/10.18653/v1/D19-1648Google ScholarCross Ref
- [44] . 2019. TENER: Adapting transformer encoder for named entity recognition. arXiv preprint arXiv:1911.04474 (2019).Google Scholar
- [45] . 2022. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering 34, 12 (2022), 5586–5609.Google Scholar
- [46] . 2019. A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI ’19/IAAI ’19/EAAI ’19). Article
101 , 8 pages.DOI: DOI: https://doi.org/10.1609/aaai.v33i01.3301817Google Scholar - [47] . 2019. Improving robustness of neural machine translation with multi-task learning. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 565–571.Google Scholar
Cross Ref
- [48] . 2015. Adam: A Method for Stochastic Optimization. In Proceedings 3rd International Conference on Learning Representations (ICLR’15), Yoshua Bengio and Yann LeCun (Eds.). San Diego, CA. http://arxiv.org/abs/1412.6980Google Scholar
Index Terms
Improving the Robustness of Loanword Identification in Social Media Texts
Recommendations
Loanword Identification in Low-Resource Languages with Minimal Supervision
Bilingual resources play a very important role in many natural language processing tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time consuming to build such resources. Lexical borrowing happens in almost every ...
Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually ...
Loanword identification based on web resources: A case study on wikipedia
AbstractTo alleviate the resource scarcity and improve the robustness in loanword identification, the current study proposes a novel loanword identification method based on Wikipedia. In this paper, we first present how to obtain loanword ...
Highlights- Obtain loanword candidate datasets and comparable corpora from Wikipedia.
- A ...






Comments