Abstract
Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.
- Wafia Adouane and Simon Dobnik. 2017. Identification of languages in Algerian Arabic multilingual documents. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 1–8.Google Scholar
Cross Ref
- Beatrice Alex. 2005. An unsupervised system for identifying English inclusions in German text. In Proceedings of the ACL Student Research Workshop. 133–138. Google Scholar
Digital Library
- Supriya Anand. 2014. Language identification for transliterated forms of Indian language queries. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’14).Google Scholar
- Srinivasu Badugu. 2014. Morphology-based POS tagging on Telugu. Int. J. Comput. Sci. Issues 11, 1 (2014), 181.Google Scholar
- Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of theAnnual Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. Association for Computational Linguistics, 229–237. Google Scholar
Digital Library
- Somnath Banerjee, Alapan Kuila, Aniruddha Roy, Sudip Kumar Naskar, Paolo Rosso, and Sivaji Bandyopadhyay. 2014. A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics. In Proceedings of the Forum for Information Retrieval Evaluation. 54–59. Google Scholar
Digital Library
- Akshar Bharati, K. Prakash Rao, Rajeev Sangal, and S. M. Bendre. 2000. Basic statistical analysis of corpus and cross comparison among corpora. Technical Report, Indian Institute of Information Technology.Google Scholar
- Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Lakshmi Bai. 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for Indian languages. In Proceedings of the Annual Language Testing Research Colloquium (LTRC’06). 1–38.Google Scholar
- Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Anil Kumar Singh. 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT. 66–76.Google Scholar
Cross Ref
- Pushpak Bhattacharyya. 2010. Indowordnet. In Proceedings of the Language Resources and Evaluation Conference (LREC’10).Google Scholar
- Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel Stranák, Vít Suchomel, Ales Tamchyna, and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC’14). 3550–3555.Google Scholar
- William B. Cavnar, John M. Trenkle et al. 1994. N-gram-based text categorization. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Vol. 161175. Citeseer.Google Scholar
- Sunita Kumar Chatterji. 1926. The Evolution of Bengali Language. Rupa, Delhi.Google Scholar
- Suniti Kumar Chatterji. 1986. The Origin and Development of the Bengali Language, vol. 1. Rupa, Delhi.Google Scholar
- B. B. Chaudhuri and S. Ghosh. 1998. A statistical study of Bangla corpus. In Proceedings of the International Conference on Computational Linguistics, Speech, and Document Processing.Google Scholar
- Alina Maria Ciobanu and Liviu Petrisor Dinu. 2013. A dictionary-based approach for evaluating orthographic methods in cognates identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). 141–147. Retrieved from https://www.aclweb.org/anthology/R13-1019.Google Scholar
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Edu. Psychol. Measure. 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104Google Scholar
Cross Ref
- Çağrı Çöltekin and Taraka Rama. 2016. Discriminating similar languages with linear SVMs and neural networks. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’16). 15–24.Google Scholar
- Michael A. Covington and Joe D. McFall. 2010. Cutting the gordian knot: The moving-average type–token ratio (MATTR). J. Quant. Linguist. 17, 2 (2010), 94–100. https://doi.org/10.1080/09296171003643098Google Scholar
Cross Ref
- Marc Damashek. 1995. Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 5199 (1995), 843–848.Google Scholar
- Niladri Sekhar Dash. 2004. Language corpora: Present Indian need. In Proceedings of the SCALLA Working Conference. 5–7.Google Scholar
- Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1285–1295. https://doi.org/10.18653/v1/D16-1136Google Scholar
Cross Ref
- Heba Elfardy, Mohamed Al-Badrashiny, and Mona Diab. 2014. AIDA: Identifying code switching in informal Arabic text. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 94–101.Google Scholar
Cross Ref
- Heba Elfardy and Mona Diab. 2013. Sentence-level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 456–461.Google Scholar
- Meng Fang and Trevor Cohn. 2017. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 587–593. https://doi.org/10.18653/v1/P17-2093Google Scholar
Cross Ref
- Pablo Gamallo, José Ramom Pichel, and Iñaki Alegria. 2017. From language identification to language distance. Physica A: Stat. Mech. Appl. 484 (2017), 152–162.Google Scholar
Cross Ref
- Jorge Gracia, Besim Kabashi, Ilan Kernerman, Marta Lanau-Coronas, and Dorielle Lonke. 2019. Results of the translation inference across dictionaries 2019 shared task. In Proceedings of TIAD-2019 Shared Task - Translation Inference Across Dictionaries co-located with the 2nd Language, Data and Knowledge Conference (LDK’19), Leipzig, Germany, May 20, 2019, Vol. 2493. CEUR-WS.org, 1–12.Google Scholar
- George Abraham Grierson. 1967. Linguistic Survey of India, vol. III. Motilal Banarsidass. https://dsal.uchicago.edu/books/lsi/.Google Scholar
- Viktor Hangya, Fabienne Braune, Alexander Fraser, and Hinrich Schütze. 2018. Two methods for domain adaptation of bilingual tasks: Delightfully simple and broadly applicable. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 810–820.Google Scholar
Cross Ref
- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690–696. Retrieved from https://kheafield.com/papers/edinburgh/estimate_paper.pdf.Google Scholar
- Goonjan Jain and D. K. Lobiyal. 2020. Word sense disambiguation using implicit information. Nat. Lang. Eng. 26, 4 (2020), 413–432.Google Scholar
Cross Ref
- Tommi Jauhiainen, Krister Lindén, and Heidi Jauhiainen. 2019. Language model adaptation for language and dialect identification of text. Nat. Lang. Eng. 25, 5 (2019), 561–583.Google Scholar
Cross Ref
- Robert J. Jeffers. 1976. Syntactic change and syntactic reconstruction. In Proceedings of the 2nd International Conference on Historical Linguistics, vol. 1. John Benjamin, 15.Google Scholar
- Girish Nath Jha. 2010. The TDIL Program and the Indian language corpora intitiative (ILCI). In Proceedings of the Language Resources and Evaluation Conference (LREC’10).Google Scholar
- Kimmo Kettunen. 2014. Can type-token ratio be used to show morphological complexity of languages?J. Quant. Linguist. 21, 3 (2014), 223–245.Google Scholar
Cross Ref
- Soma Khan, Joyanta Basu, Tulika Basu, Milton Samirakshma Bepari, Madhab Pal, and Rajib Roy. 2014. Bengali basic travel expression corpus: A statistical analysis. In Proceedings of the 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA’14). IEEE, 1–6.Google Scholar
Cross Ref
- J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom.1975. Derivation of New Readability Formulas for Navy Enlisted Personnel. Technical Report Research Branch Report. 8–75Google Scholar
- G. Bharadwaja Kumar, Kavi Narayana Murthy, and B. B. Chaudhuri. 2007. Statistical analysis of Telugu text corpora. International journal of Dravidian linguistics 36, 2 (2007), 71–99.Google Scholar
- Rohit Kumar, S. Kishore, Anumanchipalli Gopalakrishna, Rahul Chitturi, Sachin Joshi, Satinder Singh, and R. Sitaram. 2005. Development of Indian language speech databases for large vocabulary speech recognition systems. In Proceedings of the International Conference on Speech and Computer (SPECOM’05).Google Scholar
- Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2011. Challenges in developing lrs for non-scheduled languages: A case of Magahi. In Proceedings of the 5th Language and Technology Conference Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC’11). 60–64.Google Scholar
- Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2012. Developing a POS tagger for Magahi: A comparative study. In Proceedings of the 10th Workshop on Asian Language Resources. 105–114.Google Scholar
- Anil Kumar Singh. 2007. Using a single framework for computational modeling of linguistic similarity for solving many NLP problems. In Proceedings of the EUROLAN Summer School. Alexandru Ioan Cuza University of Ias̨i.Google Scholar
- Anil Kumar Singh. 2010. Modeling and Application of Linguistic Similarity. Ph.D. Dissertation. IIIT, Hyderabad, India.Google Scholar
- Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=H196sainb.Google Scholar
- Gaël Le Godais, Tal Linzen, and Emmanuel Dupoux. 2017. Comparing character-level neural language models using a lexical decision task. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 125–130. Retrieved from https://www.aclweb.org/anthology/E17-2020.Google Scholar
Cross Ref
- Mangala Madankar, M. B. Chandak, and Nekita Chavhan. 2016. Information retrieval system and machine translation: A review. Procedia Comput. Sci. 78 (2016), 845–850. Google Scholar
Digital Library
- Ian Maddieson. 2009. Calculating phonological complexity. Approach. Phonol. Complex. 85 (2009), 109.Google Scholar
- Khair Md Majumder and Yasir Arafat. 2006. Analysis of and observations from a Bangla News Corpus. 13–19. http://dspace.bracu.ac.bd/xmlui/handle/10361/616.Google Scholar
- Jean-Christophe Marcadet, Volker Fischer, and Claire Waast-Richard. 2005. A transformation-based learning approach to language identification for mixed-lingual text-to-speech synthesis. In Proceedings of the 9th European Conference on Speech Communication and Technology.Google Scholar
- Matej Martinc, Iza Skrjanec, Katja Zupan, and Senja Pollak. 2017. PAN 2017: Author profiling-gender and language variety prediction. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF’17).Google Scholar
- Colin P. Masica. 1993. The Indo-Aryan Languages. Cambridge University Press.Google Scholar
- Paul McNamee. 2005. Language identification: A solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20, 3 (Feb. 2005), 94–101. Google Scholar
Digital Library
- G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, and K. Miller.2010. WordNet: An online lexical database. Int. J. Lexicogr. 3, 4 (2010), 235–244.Google Scholar
Cross Ref
- Aanchan Mohan, Richard Rose, Sina Hamidi Ghalehjegh, and Srinivasan Umesh. 2014. Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 56 (2014), 167–180. Google Scholar
Digital Library
- Kavi Narayana Murthy and G. Bharadwaja Kumar. 2006. Language identification from small text samples. J. Quant. Linguist. 13, 01 (2006), 57–80.Google Scholar
Cross Ref
- Svetlin Nakov, Preslav Nakov, and Elena Paskaleva. 2009. Unsupervised extraction of false friends from parallel bi-texts using the web as a corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’09). 292–298.Google Scholar
- Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2016. Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 3291–3298.Google Scholar
- Atul Ku Ojha, Pitambar Behera, Srishti Singh, and Girish N. Jha. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.Google Scholar
- Steven T. Piantadosi. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonom. Bull. Rev. 21, 5 (2014), 1112–1130.Google Scholar
Cross Ref
- Jordi Porta and José-Luis Sancho. 2014. Using maximum entropy models to discriminate between similar languages and varieties. In Proceedings of the 1st Workshop on Applying NLP Tools to Similar Languages, Varieties, and Dialects. 120–128.Google Scholar
Cross Ref
- Nikhil Prabhu and S. Natarajan. 2019. Extraction of character personas from novels using dependency trees and POS tags. In Emerging Research in Computing, Information, Communication, and Applications. Springer, 65–74.Google Scholar
- Ankur Priyadarshi and Sujan Kumar Saha. 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Comput. Speech Lang. 62 (2020), 101054.Google Scholar
Cross Ref
- Katharina Probst and Ralf Brown. 2002. Using similarity scoring to improve the bilingual dictionary for word alignment. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 409–416. Google Scholar
Digital Library
- Radim Rehurek and Milan Kolkus. 2009. Language identification on the web: Extending the dictionary method. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 357–368. Google Scholar
Digital Library
- Rishikesh. 2018. Parts of speech tagger for Maithili language using HMM. Int. J. Innovat. Adv. Comput. Sci. 7 (2018), 206.Google Scholar
- Harald Romsdorfer and Beat Pfister. 2007. Text analysis and language identification for polyglot text-to-speech synthesis. Speech Commun. 49, 9 (2007), 697–724. Google Scholar
Digital Library
- Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A survey of cross-lingual word embedding models. J. Artific. Intell. Res. 65 (2019), 569–631. Google Scholar
Digital Library
- Sujan Kumar Saha and Ankur Priyadarshi. [n.d.]. A study on the importance of linguistic suffixes in Maithili POS tagger development. In Proceedings of the 7th International Conference on Mining Intelligence and Knowledge Exploration (MIKE’19). Lecture Notes in Computer Science, vol. 11987. Springer, 11–20. DOI:10.1007/978-3-030-66187-8_2Google Scholar
Digital Library
- Rajeev Sangal, Sushma Bendre, Dipti Sharma, and Prashanth Mannem. 2007. Introduction to shallow parsing contest on south asian languages. In Proceedings of the IJCAI Workshop On Shallow Parsing for South Asian Languages (SPSAL’07). 1–8.Google Scholar
- A. Sarkar, A. De Roeck, and P. Garthwaite. 2004. Easy measures for evaluating non-English corpora for language engineering: Some lessons from Arabic and Bengali. Technical report, Dept. of Comp., Faculty of Math. and Comp., Open University, Walton Hall, UK.Google Scholar
- Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL’03). Association for Computational Linguistics, 134–141. https://doi.org/10.3115/1073445.1073473 Google Scholar
Digital Library
- Vijay Kumar Sharma and Namita Mittal. 2018. Cross-lingual information retrieval: A dictionary-based query translation approach. In Advances in Computer and Computational Sciences. Springer, 611–618.Google Scholar
- Gary F. Simons and Charles D. Fennig. 2017. Ethnologue: Languages of Asia. SIL International.Google Scholar
- Anil Kumar Singh. 2006. A computational phonetic model for Indian language scripts. In Proceedings of the 5th International Workshop on Writing Systems: Constraints on Spelling Changes. 1–19.Google Scholar
- Anil Kumar Singh. 2006. Study of some distance measures for language and encoding identification. In Proceedings of the Workshop on Linguistic Distances. 63–72. Google Scholar
Digital Library
- Anil Kumar Singh. 2008. A mechanism to provide language-encoding support and an nlp friendly editor. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.Google Scholar
- Anil Kumar Singh and Jagadeesh Gorla. 2007. Identification of languages and encodings in a multilingual document. In Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval: Building and Exploring Web Corpora (WAC3’07), Vol. 4. Presses Univ. de Louvain, 95.Google Scholar
- Anil Kumar Singh, Kiran Pala, and Harshit Surana. 2008. Estimating the resource adaption cost from a resource rich language to a similar resource poor language. In Proceedings of the Language Resources and Evaluation Conference (LREC’08).Google Scholar
- Loitongbam Gyanendro Singh, Lenin Laitonjam, and Sanasam Ranbir Singh. 2016. Automatic syllabification for manipuri language. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16). 349–357.Google Scholar
- Srishti Singh. [n.d.]. Web drawn corpus for Bhojpuri. In Proceedings of the Conference on NLP, MGAHV, Wardha.Google Scholar
- Srishti Singh and Girish Nath Jha. 2015. Statistical tagger for Bhojpuri (employing support vector machine). In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI’15). IEEE, 1524–1529.Google Scholar
Cross Ref
- Peter Smit, Sami Virpioja, Stig-Arne Grönroos, and Mikko Kurimo. 2014. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 21–24.Google Scholar
Cross Ref
- Brij Mohan Lal Srivastava, Sunayana Sitaram, Rupesh Kumar Mehta, Krishna Doss Mohan, Pallavi Matani, Sandeepkumar Satpal, Kalika Bali, Radhakrishnan Srikanth, and Niranjan Nayak. 2018. Interspeech 2018: Low-resource automatic speech recognition challenge for Indian languages. In Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-resourced Languages. 11–14.Google Scholar
- Erik Sterneberg. 2012. Language identification of person names using cascaded SVMs. Bachelor’s Thesis, Uppsala University, Uppsala.Google Scholar
- Jörg Tiedemann. 2017. Cross-lingual dependency parsing for closely related languages—Helsinki’s submission to VarDial 2017. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’17). Association for Computational Linguistics, 131–136. https://doi.org/10.18653/v1/W17-1216Google Scholar
Cross Ref
- Zankhana B. Vaishnav and Priti S. Sajja. 2019. Knowledge-based approach for word sense disambiguation using genetic algorithm for gujarati. In Proceedings of the Conference on Information and Communication Technology for Intelligent Systems (ICTIS’19). Springer, 485–494.Google Scholar
- Manindra K. Verma. 1991. Exploring the parameters of agreement: The case of Magahi. Lang. Sci. 13, 2 (1991), 125–143.Google Scholar
Cross Ref
- Haoxing Wang and Laurianne Sitbon. 2014. Multilingual lexical resources to detect cognates in non-aligned texts. In Proceedings of the Australasian Language Technology Association Workshop, Vol. 12. 14–22.Google Scholar
- Gergely Windisch and László Csink. 2005. Language identification using global statistics of natural languages. In Proceedings of the 2nd Romanian-Hungarian Joint Symposium on Applied Computational Intelligence (SACI’05). 243–255.Google Scholar
- Nianheng Wu, Eric DeMattos, Kwok Him So, Pin-zhen Chen, and Çağrı Çöltekin. 2019. Language discrimination and transfer learning for similar languages: Experiments with feature combinations and adaptation. In Proceedings of the 6th Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, 54–63. Retrieved from https://www.aclweb.org/anthology/W19-1406.Google Scholar
- Martin Wynne. 2005. Developing Linguistic Corpora: A Guide to Good Practice. Vol. 92. Oxbow Books Oxford.Google Scholar
- Yogendra P. Yadava, Oliver Bond, Irina Nikolaeva, and Sandy Ritchie. 2019. The syntax of possessor prominence in Maithili. Prom. Intern. Possess. (2019), 39–79. DOI:10.1093/oso/9780198812142.003.0002Google Scholar
- Yin-Lai Yeong and Tien-Ping Tan. 2011. Applying grapheme, word, and syllable information for language identification in code switching sentences. In Proceedings of the International Conference on Asian Language Processing. IEEE, 111–114. Google Scholar
Digital Library
- Jia-Li You, Yi-Ning Chen, Min Chu, Frank K. Soong, and Jin-Lin Wang. 2008. Identifying language origin of named entity with multiple information sources. IEEE Trans. Audio Speech Lang. Process. 16, 6 (2008), 1077–1086. Google Scholar
Digital Library
- Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiedemann, et al. 2018. Language identification and morphosyntactic tagging. The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects ([email protected]’18), Santa Fe, New Mexico. Association for Computational Linguistics, 1–17. https://aclanthology.org/W18-3901/.Google Scholar
- Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Bilingual lexicon induction from non-parallel data with minimal supervision. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems. MIT Press, 649–657. Google Scholar
Digital Library
- Yujie Zhang. 2019. Improving performance of NMT using semantic concept of wordnet synset. In Proceedings of the 14th China Workshop on Machine Translation (CWMT’18), Vol. 954. Springer, 39.Google Scholar
- Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1568–1575. https://doi.org/10.18653/v1/D16-1163Google Scholar
Cross Ref
Index Terms
Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications
Recommendations
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS19: Proceedings of the 2nd International Conference on Networking, Information Systems & SecurityPart-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
English to Hindi Paraphrase Convention for Translating Homoeopathy Literature
The rule based approach to machine translation MT confines grammatical rules between the source and the target language with the goal of constructing grammatical translation between the language pair. In this paper, we describe the structural ...






Comments