Abstract
Availability of corpora is a basic requirement for conducting research in a particular language. Unfortunately, for a morphologically rich language like Urdu, despite being used by over a 100 million people around the globe, the dearth of corpora is a major reason for the lack of attention and advancement in research. To this end, we present the first-ever large-scale publicly available Roman-Urdu parallel corpus, Roman-Urdu-Parl, with 6.37 million sentence-pairs. It is a huge corpus collected from diverse sources, annotated using crowd-sourcing techniques, and also assured for quality. It has a total of 92.76 million Roman-Urdu words, 92.85 million Urdu words, Roman-Urdu vocabulary of 42.9 K words, and Urdu vocabulary of 43.8 K words. Roman-Urdu-Parl has been built to ensure that it not only captures the morphological and linguistic features of the language but also the heterogeneity and variations arising due to demographic conditions. We validate the authenticity and quality of our corpus by using it to address two natural language processing research problems, that is, on learning word embeddings and building a machine transliteration system. Our contribution of the corpus leads to exceptional results in both settings, for example, our machine transliteration system sets a new state-of-the-art with a Bilingual Evaluation Understudy (BLEU) score of 84.67. We believe that Roman-Urdu-Parl can serve as fuel for igniting and advancing works in many research areas related to the Urdu language.
- [1] . 2017. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 937–947.Google Scholar
Cross Ref
- [2] . 2019. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. 3874–3884.Google Scholar
- [3] . 2009. Roman to Urdu transliteration using wordlist. In Proceedings of the Conference on Language and Technology. 305–309.Google Scholar
- [4] . 2017. Sequence to sequence networks for Roman-Urdu to Urdu transliteration. In Proceedings of the Multi-topic Conference (INMIC), 2017 International. IEEE, 1–7.Google Scholar
Cross Ref
- [5] . 2019. Analogies explained: Towards understanding word embeddings. International Conference on Machine Learning 97 (2019), 223–231.Google Scholar
- [6] . 2019. Improved Arabic–Chinese Machine Translation with Linguistic Input Features. Future Internet 11, 1 (2019), 22.Google Scholar
Cross Ref
- [7] . 2014. Learning to exploit different translation resources for cross language information retrieval. International Journal of Information & Communication Technology Research 6, 1 (2014), 55–68.Google Scholar
- [8] . 2018. Extracting parallel sentences from comparable corpora with STACC variants. In Proceedings of the 11th Workshop on Building and Using Comparable Corpora. 48–52.Google Scholar
- [9] . 2014. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 3 (2015).Google Scholar
- [10] . 2012. Urdu - Roman Transliteration via Finite State Transducers. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing. Association for Computational Linguistics, Donostia–San Sebastian, 25–29. Retrieved from https://www.aclweb.org/anthology/W12-6204.Google Scholar
- [11] . 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.Google Scholar
Cross Ref
- [12] . 2018. Improving wordnets for under-resourced languages using machine translation. In Proceedings of the 9th Global WordNet Conference. 78.Google Scholar
- [13] . 2019. Findings of the WMT 2019 shared task on automatic post-editing. In Proceedings of the 4th Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2). 11–28.Google Scholar
Cross Ref
- [14] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), 1724–1734.Google Scholar
- [15] . 2015. Parallel sentence extraction based on unsupervised bilingual lexicon extraction from comparable corpora. Journal of Natural Language Processing 22, 3 (2015), 139–170.Google Scholar
Cross Ref
- [16] . 2012. Producing Data for Under-Resourced Languages: A Dari-English Parallel Corpus of Multi-Genre Text. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Government MT User Program.Google Scholar
- [17] . 2009. Urdu Morphology. Maryland: Centre for Advanced Study of Language, University of Maryland (2009).Google Scholar
- [18] . 2006. The wikipedia xml corpus. In Proceedings of the International Workshop of the Initiative for the Evaluation of XML Retrieval. Springer, 12–19.Google Scholar
Digital Library
- [19] . 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1 (2019), 4171–4186.Google Scholar
- [20] . 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.Google Scholar
Digital Library
- [21] . 2012. Measuring user productivity in machine translation enhanced computer assisted translation. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas. AMTA Madison, WI, 44–56.Google Scholar
- [22] . 2018. Building multilingual parallel corpora for under-resourced languages using translated fictional texts. Sustaining Knowledge Diversity in the Digital Age (2018), 39.Google Scholar
- [23] . 2014. Cleaning the Europarl corpus for linguistic applications. In Proceedings of the Konvens.Google Scholar
- [24] . 2013. The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 439–448.Google Scholar
Digital Library
- [25] . 2019. Two new evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19), Vol. 9.Google Scholar
- [26] . 2015. Sentiment lexicon-based features for sentiment analysis in short text. Research in Computing Science 90 (2015), 217–226.Google Scholar
Cross Ref
- [27] . 2018. Unsupervised parallel sentence extraction from comparable corpora. In Proceedings of the International Workshop on Spoken Language Translation.Google Scholar
- [28] . 2017. Multiple system combination for PersoArabic-Latin transliteration. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 469–481.Google Scholar
- [29] . 2014. A Tagged Corpus and a Tagger for Urdu.. In Proceedings of the 9th International Conference on Language Resources and Evaluation. 2938–2943.Google Scholar
- [30] . 2016. Can active memory replace attention? In Proceedings of the Advances in Neural Information Processing Systems 29, , , , , and (Eds.), Curran Associates, Inc., 3781–3789. Retrieved from http://papers.nips.cc/paper/6295-can-active-memory-replace-attention.pdf.Google Scholar
- [31] . 2016. Neural machine translation in linear time.Google Scholar
- [32] . 2012. Word sense disambiguation based on example sentences in dictionary and automatically acquired from parallel corpus. In Proceedings of the International Conference on NLP. Springer, 210–221.Google Scholar
Cross Ref
- [33] . 2013. Using parallel corpora for word sense disambiguation. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. 336–341.Google Scholar
- [34] . 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Vol. 5. 79–86.Google Scholar
- [35] Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Technical Report . Citeseer.Google Scholar - [36] . 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097–1105.Google Scholar
Digital Library
- [37] . 2017. The iit bombay english-hindi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18) 1.Google Scholar
- [38] . 2019. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems 32 (2019), 7059–7069.Google Scholar
- [39] . 2018. Phrase-based & neural unsupervised machine translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 18 (2018), 5039–5049.Google Scholar
- [40] . 2013. Assessing post-editing efficiency in a realistic translation environment. In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice. 83–91.Google Scholar
- [41] . 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (
Nov 1998), 2278–2324. DOI: https://doi.org/10.1109/5.726791Google ScholarCross Ref
- [42] . 2011. Parasense or how to use parallel corpora for word sense disambiguation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. ACL, 317–322.Google Scholar
Digital Library
- [43] . 2011. Using parallel corpora for word sense disambiguation. In Proceedings of the 23rd Benelux conference on Artificial Intelligence. 407–408.Google Scholar
- [44] . 2014. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision. Springer, 740–755.Google Scholar
Cross Ref
- [45] . 2015. Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 15 (2015), 1412–1421.Google Scholar
- [46] . 2014. Addressing the rare word problem in neural machine translation. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 1 (2015), 11–19.Google Scholar
- [47] . 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. ACL, 142–150.Google Scholar
Digital Library
- [48] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.Google Scholar
- [49] . 2010. Transliterating Urdu for a broad-coverage Urdu/Hindi LFG grammar. In Proceedings of the 7th International Conference on Language Resources and Evaluation.Google Scholar
- [50] Shervin Malmasi and Mark Dras. 2015. Automatic language identification for Persian and Dari texts. In Proceedings of the 14th International Conference of the Pacific Association for Computational Linguistics. 59–64.Google Scholar
- [51] . 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, (ICLR’13) 1.Google Scholar
- [52] . 1995. WordNet: a lexical database for English. Communications of the ACM 38, 11 (1995), 39–41.Google Scholar
Digital Library
- [53] . 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 74–81.Google Scholar
Digital Library
- [54] Win Pa Pa, Ye Kyaw Thu, Andrew Finch, and Eiichiro Sumita. 2016. A study of statistical machine translation methods for under resourced languages. Procedia Computer Science 81 (2016), 250–257.Google Scholar
Cross Ref
- [55] . 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. ACL, 311–318.Google Scholar
- [56] . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532–1543.Google Scholar
Cross Ref
- [57] . 2018. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1 (2018), 2227–2237.Google Scholar
- [58] . 2011. Tep: Tehran english-persian parallel corpus. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 68–79.Google Scholar
Cross Ref
- [59] . 2012. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the 7th Workshop on Statistical Machine Translation. ACL, 401–409.Google Scholar
- [60] . 2018. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. NAACL HLT 2018 1 (2018), 112.Google Scholar
- [61] . 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation.Google Scholar
- [62] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in neural information processing systems. 3104–3112.Google Scholar
Digital Library
- [63] . 2014. UM-Corpus: A large English-Chinese parallel corpus for statistical machine Translation. In Proceedings of the 9th International Conference on Language Resources and Evaluation. 1837–1842.Google Scholar
- [64] . 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Vol. 2012. 2214–2218.Google Scholar
- [65] . 2018. Artificial Urdu text detection and localization from individual video frames. Mehran University Research Journal of Engineering and Technology 37, 2 (2018), 429–438. DOI: https://doi.org/10.22581/muet1982.1802.18Google Scholar
Cross Ref
- [66] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam M. Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine translation. Proceedings of the 13th Conference of the Association for Machine Translation in the Americas 1 (2018), 193–199.Google Scholar
- [67] . 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 6000–6010.Google Scholar
- [68] . 2014. Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs. Procedia Technology 18 (2014), 126–132.Google Scholar
Cross Ref
- [69] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. Retrieved from https://arxiv.org/abs/1609.08144.Google Scholar
- [70] . 2019. Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation.Google Scholar
- [71] . 2018. Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation. Proceedings of the Third Conference on Machine Translation 3 (2018), 232–243.Google Scholar
- [72] . 2018. PronouncUR: An Urdu Pronunciation Lexicon Generator. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) 11.Google Scholar
- [73] . 2016. The united nations parallel corpus v1. 0. In Proceedings of the 10th International Conference on Language Resources and Evaluation. 3530–3534.Google Scholar
- [74] . 2016. Transfer learning for low-resource neural machine translation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016), 1568–1575.Google Scholar
Index Terms
Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding
Recommendations
A word sense disambiguation corpus for Urdu
AbstractThe aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an ...
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
Urdu language processing: a survey
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core ...






Comments