skip to main content
research-article

Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

Published:13 September 2021Publication History
Skip Abstract Section

Abstract

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.

References

  1. Wafia Adouane and Simon Dobnik. 2017. Identification of languages in Algerian Arabic multilingual documents. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 1–8.Google ScholarGoogle ScholarCross RefCross Ref
  2. Beatrice Alex. 2005. An unsupervised system for identifying English inclusions in German text. In Proceedings of the ACL Student Research Workshop. 133–138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Supriya Anand. 2014. Language identification for transliterated forms of Indian language queries. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’14).Google ScholarGoogle Scholar
  4. Srinivasu Badugu. 2014. Morphology-based POS tagging on Telugu. Int. J. Comput. Sci. Issues 11, 1 (2014), 181.Google ScholarGoogle Scholar
  5. Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of theAnnual Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. Association for Computational Linguistics, 229–237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Somnath Banerjee, Alapan Kuila, Aniruddha Roy, Sudip Kumar Naskar, Paolo Rosso, and Sivaji Bandyopadhyay. 2014. A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics. In Proceedings of the Forum for Information Retrieval Evaluation. 54–59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Akshar Bharati, K. Prakash Rao, Rajeev Sangal, and S. M. Bendre. 2000. Basic statistical analysis of corpus and cross comparison among corpora. Technical Report, Indian Institute of Information Technology.Google ScholarGoogle Scholar
  8. Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Lakshmi Bai. 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for Indian languages. In Proceedings of the Annual Language Testing Research Colloquium (LTRC’06). 1–38.Google ScholarGoogle Scholar
  9. Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Anil Kumar Singh. 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT. 66–76.Google ScholarGoogle ScholarCross RefCross Ref
  10. Pushpak Bhattacharyya. 2010. Indowordnet. In Proceedings of the Language Resources and Evaluation Conference (LREC’10).Google ScholarGoogle Scholar
  11. Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel Stranák, Vít Suchomel, Ales Tamchyna, and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC’14). 3550–3555.Google ScholarGoogle Scholar
  12. William B. Cavnar, John M. Trenkle et al. 1994. N-gram-based text categorization. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Vol. 161175. Citeseer.Google ScholarGoogle Scholar
  13. Sunita Kumar Chatterji. 1926. The Evolution of Bengali Language. Rupa, Delhi.Google ScholarGoogle Scholar
  14. Suniti Kumar Chatterji. 1986. The Origin and Development of the Bengali Language, vol. 1. Rupa, Delhi.Google ScholarGoogle Scholar
  15. B. B. Chaudhuri and S. Ghosh. 1998. A statistical study of Bangla corpus. In Proceedings of the International Conference on Computational Linguistics, Speech, and Document Processing.Google ScholarGoogle Scholar
  16. Alina Maria Ciobanu and Liviu Petrisor Dinu. 2013. A dictionary-based approach for evaluating orthographic methods in cognates identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). 141–147. Retrieved from https://www.aclweb.org/anthology/R13-1019.Google ScholarGoogle Scholar
  17. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Edu. Psychol. Measure. 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104Google ScholarGoogle ScholarCross RefCross Ref
  18. Çağrı Çöltekin and Taraka Rama. 2016. Discriminating similar languages with linear SVMs and neural networks. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’16). 15–24.Google ScholarGoogle Scholar
  19. Michael A. Covington and Joe D. McFall. 2010. Cutting the gordian knot: The moving-average type–token ratio (MATTR). J. Quant. Linguist. 17, 2 (2010), 94–100. https://doi.org/10.1080/09296171003643098Google ScholarGoogle ScholarCross RefCross Ref
  20. Marc Damashek. 1995. Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 5199 (1995), 843–848.Google ScholarGoogle Scholar
  21. Niladri Sekhar Dash. 2004. Language corpora: Present Indian need. In Proceedings of the SCALLA Working Conference. 5–7.Google ScholarGoogle Scholar
  22. Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1285–1295. https://doi.org/10.18653/v1/D16-1136Google ScholarGoogle ScholarCross RefCross Ref
  23. Heba Elfardy, Mohamed Al-Badrashiny, and Mona Diab. 2014. AIDA: Identifying code switching in informal Arabic text. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 94–101.Google ScholarGoogle ScholarCross RefCross Ref
  24. Heba Elfardy and Mona Diab. 2013. Sentence-level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 456–461.Google ScholarGoogle Scholar
  25. Meng Fang and Trevor Cohn. 2017. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 587–593. https://doi.org/10.18653/v1/P17-2093Google ScholarGoogle ScholarCross RefCross Ref
  26. Pablo Gamallo, José Ramom Pichel, and Iñaki Alegria. 2017. From language identification to language distance. Physica A: Stat. Mech. Appl. 484 (2017), 152–162.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jorge Gracia, Besim Kabashi, Ilan Kernerman, Marta Lanau-Coronas, and Dorielle Lonke. 2019. Results of the translation inference across dictionaries 2019 shared task. In Proceedings of TIAD-2019 Shared Task - Translation Inference Across Dictionaries co-located with the 2nd Language, Data and Knowledge Conference (LDK’19), Leipzig, Germany, May 20, 2019, Vol. 2493. CEUR-WS.org, 1–12.Google ScholarGoogle Scholar
  28. George Abraham Grierson. 1967. Linguistic Survey of India, vol. III. Motilal Banarsidass. https://dsal.uchicago.edu/books/lsi/.Google ScholarGoogle Scholar
  29. Viktor Hangya, Fabienne Braune, Alexander Fraser, and Hinrich Schütze. 2018. Two methods for domain adaptation of bilingual tasks: Delightfully simple and broadly applicable. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 810–820.Google ScholarGoogle ScholarCross RefCross Ref
  30. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690–696. Retrieved from https://kheafield.com/papers/edinburgh/estimate_paper.pdf.Google ScholarGoogle Scholar
  31. Goonjan Jain and D. K. Lobiyal. 2020. Word sense disambiguation using implicit information. Nat. Lang. Eng. 26, 4 (2020), 413–432.Google ScholarGoogle ScholarCross RefCross Ref
  32. Tommi Jauhiainen, Krister Lindén, and Heidi Jauhiainen. 2019. Language model adaptation for language and dialect identification of text. Nat. Lang. Eng. 25, 5 (2019), 561–583.Google ScholarGoogle ScholarCross RefCross Ref
  33. Robert J. Jeffers. 1976. Syntactic change and syntactic reconstruction. In Proceedings of the 2nd International Conference on Historical Linguistics, vol. 1. John Benjamin, 15.Google ScholarGoogle Scholar
  34. Girish Nath Jha. 2010. The TDIL Program and the Indian language corpora intitiative (ILCI). In Proceedings of the Language Resources and Evaluation Conference (LREC’10).Google ScholarGoogle Scholar
  35. Kimmo Kettunen. 2014. Can type-token ratio be used to show morphological complexity of languages?J. Quant. Linguist. 21, 3 (2014), 223–245.Google ScholarGoogle ScholarCross RefCross Ref
  36. Soma Khan, Joyanta Basu, Tulika Basu, Milton Samirakshma Bepari, Madhab Pal, and Rajib Roy. 2014. Bengali basic travel expression corpus: A statistical analysis. In Proceedings of the 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA’14). IEEE, 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  37. J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom.1975. Derivation of New Readability Formulas for Navy Enlisted Personnel. Technical Report Research Branch Report. 8–75Google ScholarGoogle Scholar
  38. G. Bharadwaja Kumar, Kavi Narayana Murthy, and B. B. Chaudhuri. 2007. Statistical analysis of Telugu text corpora. International journal of Dravidian linguistics 36, 2 (2007), 71–99.Google ScholarGoogle Scholar
  39. Rohit Kumar, S. Kishore, Anumanchipalli Gopalakrishna, Rahul Chitturi, Sachin Joshi, Satinder Singh, and R. Sitaram. 2005. Development of Indian language speech databases for large vocabulary speech recognition systems. In Proceedings of the International Conference on Speech and Computer (SPECOM’05).Google ScholarGoogle Scholar
  40. Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2011. Challenges in developing lrs for non-scheduled languages: A case of Magahi. In Proceedings of the 5th Language and Technology Conference Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC’11). 60–64.Google ScholarGoogle Scholar
  41. Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2012. Developing a POS tagger for Magahi: A comparative study. In Proceedings of the 10th Workshop on Asian Language Resources. 105–114.Google ScholarGoogle Scholar
  42. Anil Kumar Singh. 2007. Using a single framework for computational modeling of linguistic similarity for solving many NLP problems. In Proceedings of the EUROLAN Summer School. Alexandru Ioan Cuza University of Ias̨i.Google ScholarGoogle Scholar
  43. Anil Kumar Singh. 2010. Modeling and Application of Linguistic Similarity. Ph.D. Dissertation. IIIT, Hyderabad, India.Google ScholarGoogle Scholar
  44. Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=H196sainb.Google ScholarGoogle Scholar
  45. Gaël Le Godais, Tal Linzen, and Emmanuel Dupoux. 2017. Comparing character-level neural language models using a lexical decision task. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 125–130. Retrieved from https://www.aclweb.org/anthology/E17-2020.Google ScholarGoogle ScholarCross RefCross Ref
  46. Mangala Madankar, M. B. Chandak, and Nekita Chavhan. 2016. Information retrieval system and machine translation: A review. Procedia Comput. Sci. 78 (2016), 845–850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ian Maddieson. 2009. Calculating phonological complexity. Approach. Phonol. Complex. 85 (2009), 109.Google ScholarGoogle Scholar
  48. Khair Md Majumder and Yasir Arafat. 2006. Analysis of and observations from a Bangla News Corpus. 13–19. http://dspace.bracu.ac.bd/xmlui/handle/10361/616.Google ScholarGoogle Scholar
  49. Jean-Christophe Marcadet, Volker Fischer, and Claire Waast-Richard. 2005. A transformation-based learning approach to language identification for mixed-lingual text-to-speech synthesis. In Proceedings of the 9th European Conference on Speech Communication and Technology.Google ScholarGoogle Scholar
  50. Matej Martinc, Iza Skrjanec, Katja Zupan, and Senja Pollak. 2017. PAN 2017: Author profiling-gender and language variety prediction. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF’17).Google ScholarGoogle Scholar
  51. Colin P. Masica. 1993. The Indo-Aryan Languages. Cambridge University Press.Google ScholarGoogle Scholar
  52. Paul McNamee. 2005. Language identification: A solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20, 3 (Feb. 2005), 94–101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, and K. Miller.2010. WordNet: An online lexical database. Int. J. Lexicogr. 3, 4 (2010), 235–244.Google ScholarGoogle ScholarCross RefCross Ref
  54. Aanchan Mohan, Richard Rose, Sina Hamidi Ghalehjegh, and Srinivasan Umesh. 2014. Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 56 (2014), 167–180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Kavi Narayana Murthy and G. Bharadwaja Kumar. 2006. Language identification from small text samples. J. Quant. Linguist. 13, 01 (2006), 57–80.Google ScholarGoogle ScholarCross RefCross Ref
  56. Svetlin Nakov, Preslav Nakov, and Elena Paskaleva. 2009. Unsupervised extraction of false friends from parallel bi-texts using the web as a corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’09). 292–298.Google ScholarGoogle Scholar
  57. Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2016. Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 3291–3298.Google ScholarGoogle Scholar
  58. Atul Ku Ojha, Pitambar Behera, Srishti Singh, and Girish N. Jha. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.Google ScholarGoogle Scholar
  59. Steven T. Piantadosi. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonom. Bull. Rev. 21, 5 (2014), 1112–1130.Google ScholarGoogle ScholarCross RefCross Ref
  60. Jordi Porta and José-Luis Sancho. 2014. Using maximum entropy models to discriminate between similar languages and varieties. In Proceedings of the 1st Workshop on Applying NLP Tools to Similar Languages, Varieties, and Dialects. 120–128.Google ScholarGoogle ScholarCross RefCross Ref
  61. Nikhil Prabhu and S. Natarajan. 2019. Extraction of character personas from novels using dependency trees and POS tags. In Emerging Research in Computing, Information, Communication, and Applications. Springer, 65–74.Google ScholarGoogle Scholar
  62. Ankur Priyadarshi and Sujan Kumar Saha. 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Comput. Speech Lang. 62 (2020), 101054.Google ScholarGoogle ScholarCross RefCross Ref
  63. Katharina Probst and Ralf Brown. 2002. Using similarity scoring to improve the bilingual dictionary for word alignment. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 409–416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Radim Rehurek and Milan Kolkus. 2009. Language identification on the web: Extending the dictionary method. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 357–368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Rishikesh. 2018. Parts of speech tagger for Maithili language using HMM. Int. J. Innovat. Adv. Comput. Sci. 7 (2018), 206.Google ScholarGoogle Scholar
  66. Harald Romsdorfer and Beat Pfister. 2007. Text analysis and language identification for polyglot text-to-speech synthesis. Speech Commun. 49, 9 (2007), 697–724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A survey of cross-lingual word embedding models. J. Artific. Intell. Res. 65 (2019), 569–631. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Sujan Kumar Saha and Ankur Priyadarshi. [n.d.]. A study on the importance of linguistic suffixes in Maithili POS tagger development. In Proceedings of the 7th International Conference on Mining Intelligence and Knowledge Exploration (MIKE’19). Lecture Notes in Computer Science, vol. 11987. Springer, 11–20. DOI:10.1007/978-3-030-66187-8_2Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Rajeev Sangal, Sushma Bendre, Dipti Sharma, and Prashanth Mannem. 2007. Introduction to shallow parsing contest on south asian languages. In Proceedings of the IJCAI Workshop On Shallow Parsing for South Asian Languages (SPSAL’07). 1–8.Google ScholarGoogle Scholar
  70. A. Sarkar, A. De Roeck, and P. Garthwaite. 2004. Easy measures for evaluating non-English corpora for language engineering: Some lessons from Arabic and Bengali. Technical report, Dept. of Comp., Faculty of Math. and Comp., Open University, Walton Hall, UK.Google ScholarGoogle Scholar
  71. Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL’03). Association for Computational Linguistics, 134–141. https://doi.org/10.3115/1073445.1073473 Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Vijay Kumar Sharma and Namita Mittal. 2018. Cross-lingual information retrieval: A dictionary-based query translation approach. In Advances in Computer and Computational Sciences. Springer, 611–618.Google ScholarGoogle Scholar
  73. Gary F. Simons and Charles D. Fennig. 2017. Ethnologue: Languages of Asia. SIL International.Google ScholarGoogle Scholar
  74. Anil Kumar Singh. 2006. A computational phonetic model for Indian language scripts. In Proceedings of the 5th International Workshop on Writing Systems: Constraints on Spelling Changes. 1–19.Google ScholarGoogle Scholar
  75. Anil Kumar Singh. 2006. Study of some distance measures for language and encoding identification. In Proceedings of the Workshop on Linguistic Distances. 63–72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Anil Kumar Singh. 2008. A mechanism to provide language-encoding support and an nlp friendly editor. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.Google ScholarGoogle Scholar
  77. Anil Kumar Singh and Jagadeesh Gorla. 2007. Identification of languages and encodings in a multilingual document. In Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval: Building and Exploring Web Corpora (WAC3’07), Vol. 4. Presses Univ. de Louvain, 95.Google ScholarGoogle Scholar
  78. Anil Kumar Singh, Kiran Pala, and Harshit Surana. 2008. Estimating the resource adaption cost from a resource rich language to a similar resource poor language. In Proceedings of the Language Resources and Evaluation Conference (LREC’08).Google ScholarGoogle Scholar
  79. Loitongbam Gyanendro Singh, Lenin Laitonjam, and Sanasam Ranbir Singh. 2016. Automatic syllabification for manipuri language. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16). 349–357.Google ScholarGoogle Scholar
  80. Srishti Singh. [n.d.]. Web drawn corpus for Bhojpuri. In Proceedings of the Conference on NLP, MGAHV, Wardha.Google ScholarGoogle Scholar
  81. Srishti Singh and Girish Nath Jha. 2015. Statistical tagger for Bhojpuri (employing support vector machine). In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI’15). IEEE, 1524–1529.Google ScholarGoogle ScholarCross RefCross Ref
  82. Peter Smit, Sami Virpioja, Stig-Arne Grönroos, and Mikko Kurimo. 2014. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 21–24.Google ScholarGoogle ScholarCross RefCross Ref
  83. Brij Mohan Lal Srivastava, Sunayana Sitaram, Rupesh Kumar Mehta, Krishna Doss Mohan, Pallavi Matani, Sandeepkumar Satpal, Kalika Bali, Radhakrishnan Srikanth, and Niranjan Nayak. 2018. Interspeech 2018: Low-resource automatic speech recognition challenge for Indian languages. In Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-resourced Languages. 11–14.Google ScholarGoogle Scholar
  84. Erik Sterneberg. 2012. Language identification of person names using cascaded SVMs. Bachelor’s Thesis, Uppsala University, Uppsala.Google ScholarGoogle Scholar
  85. Jörg Tiedemann. 2017. Cross-lingual dependency parsing for closely related languages—Helsinki’s submission to VarDial 2017. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’17). Association for Computational Linguistics, 131–136. https://doi.org/10.18653/v1/W17-1216Google ScholarGoogle ScholarCross RefCross Ref
  86. Zankhana B. Vaishnav and Priti S. Sajja. 2019. Knowledge-based approach for word sense disambiguation using genetic algorithm for gujarati. In Proceedings of the Conference on Information and Communication Technology for Intelligent Systems (ICTIS’19). Springer, 485–494.Google ScholarGoogle Scholar
  87. Manindra K. Verma. 1991. Exploring the parameters of agreement: The case of Magahi. Lang. Sci. 13, 2 (1991), 125–143.Google ScholarGoogle ScholarCross RefCross Ref
  88. Haoxing Wang and Laurianne Sitbon. 2014. Multilingual lexical resources to detect cognates in non-aligned texts. In Proceedings of the Australasian Language Technology Association Workshop, Vol. 12. 14–22.Google ScholarGoogle Scholar
  89. Gergely Windisch and László Csink. 2005. Language identification using global statistics of natural languages. In Proceedings of the 2nd Romanian-Hungarian Joint Symposium on Applied Computational Intelligence (SACI’05). 243–255.Google ScholarGoogle Scholar
  90. Nianheng Wu, Eric DeMattos, Kwok Him So, Pin-zhen Chen, and Çağrı Çöltekin. 2019. Language discrimination and transfer learning for similar languages: Experiments with feature combinations and adaptation. In Proceedings of the 6th Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, 54–63. Retrieved from https://www.aclweb.org/anthology/W19-1406.Google ScholarGoogle Scholar
  91. Martin Wynne. 2005. Developing Linguistic Corpora: A Guide to Good Practice. Vol. 92. Oxbow Books Oxford.Google ScholarGoogle Scholar
  92. Yogendra P. Yadava, Oliver Bond, Irina Nikolaeva, and Sandy Ritchie. 2019. The syntax of possessor prominence in Maithili. Prom. Intern. Possess. (2019), 39–79. DOI:10.1093/oso/9780198812142.003.0002Google ScholarGoogle Scholar
  93. Yin-Lai Yeong and Tien-Ping Tan. 2011. Applying grapheme, word, and syllable information for language identification in code switching sentences. In Proceedings of the International Conference on Asian Language Processing. IEEE, 111–114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Jia-Li You, Yi-Ning Chen, Min Chu, Frank K. Soong, and Jin-Lin Wang. 2008. Identifying language origin of named entity with multiple information sources. IEEE Trans. Audio Speech Lang. Process. 16, 6 (2008), 1077–1086. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiedemann, et al. 2018. Language identification and morphosyntactic tagging. The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects ([email protected]’18), Santa Fe, New Mexico. Association for Computational Linguistics, 1–17. https://aclanthology.org/W18-3901/.Google ScholarGoogle Scholar
  96. Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Bilingual lexicon induction from non-parallel data with minimal supervision. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems. MIT Press, 649–657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Yujie Zhang. 2019. Improving performance of NMT using semantic concept of wordnet synset. In Proceedings of the 14th China Workshop on Machine Translation (CWMT’18), Vol. 954. Springer, 39.Google ScholarGoogle Scholar
  99. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1568–1575. https://doi.org/10.18653/v1/D16-1163Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!