Abstract
Due to the lack of a large annotated corpus, many resource-poor Indian languages struggle to reap the benefits of recent deep feature representations in Natural Language Processing (NLP). Moreover, adopting existing language models trained on large English corpora for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we explore the traditional to recent efficient representations to overcome the challenges of a low resource language, Telugu. In particular, our main objective is to mitigate the low-resource problem for Telugu. Overall, we present several contributions to a resource-poor language viz. Telugu. (i) a large annotated data (35,142 sentences in each task) for multiple NLP tasks such as sentiment analysis, emotion identification, hate-speech detection, and sarcasm detection, (ii) we create different lexicons for sentiment, emotion, and hate-speech for improving the efficiency of the models, (iii) pretrained word and sentence embeddings, and (iv) different pretrained language models for Telugu such as ELMo-Te, BERT-Te, RoBERTa-Te, ALBERT-Te, and DistilBERT-Te on a large Telugu corpus consisting of 8,015,588 sentences (1,637,408 sentences from Telugu Wikipedia and 6,378,180 sentences crawled from different Telugu websites). Further, we show that these representations significantly improve the performance of four NLP tasks and present the benchmark results for Telugu. We argue that our pretrained embeddings are competitive or better than the existing multilingual pretrained models: mBERT, XLM-R, and IndicBERT. Lastly, the fine-tuning of pretrained models show higher performance than linear probing results on four NLP tasks with the following F1-scores: Sentiment (68.72), Emotion (58.04), Hate-Speech (64.27), and Sarcasm (77.93). We also experiment on publicly available Telugu datasets (Named Entity Recognition, Article Genre Classification, and Sentiment Analysis) and find that our Telugu pretrained language models (BERT-Te and RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We open-source our corpus, four different datasets, lexicons, embeddings, and code https://github.com/Cha14ran/DREAM-T. The pretrained Transformer models for Telugu are available at https://huggingface.co/ltrctelugu.
- [1] . 2017. EmoNet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers). 718–728.Google Scholar
Cross Ref
- [2] . 2016. Modelling context with user embeddings for sarcasm detection in social media. arXiv preprint arXiv:1607.00976 (2016).Google Scholar
- [3] . 2017. Building Telugu WordNet using expansion approach. In The WordNet in Indian Languages. Springer, 201–208.Google Scholar
Cross Ref
- [4] . 2014. Modelling sarcasm in Twitter, a novel approach. In Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 50–58.Google Scholar
Cross Ref
- [5] . 2009. Multi-label text classification approach for sentence level news emotion analysis. In International Conference on Pattern Recognition and Machine Intelligence. Springer, 261–266.Google Scholar
Digital Library
- [6] . 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.Google Scholar
Cross Ref
- [7] . 2002. An analysis of the AskMSR question-answering system. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). 257–264.Google Scholar
Digital Library
- [8] . 2015. Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet 7, 2 (2015), 223–242.Google Scholar
Cross Ref
- [9] . 2016. Us and them: Identifying cyber hate on Twitter across multiple protected characteristics. EPJ Data Science 5, 1 (2016), 11.Google Scholar
Cross Ref
- [10] . 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.Google Scholar
Digital Library
- [11] . 2010. Emotion cause detection with linguistic constructions. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 179–187.Google Scholar
Digital Library
- [12] . 2018. Lifelong learning for sentiment classification. arXiv preprint arXiv:1801.02808 (2018).Google Scholar
- [13] . 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 103–111.Google Scholar
Cross Ref
- [14] . 2008. Learning with compositional semantics as structural inference for subsentential sentiment analysis. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. 793–801.Google Scholar
Digital Library
- [15] . 2018. Sentiment analysis of code-mixed languages leveraging resource rich languages. arXiv preprint arXiv:1804.00806 (2018).Google Scholar
- [16] . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google Scholar
- [17] . 2005. Semantic role labelling with tree conditional random fields. (2005).Google Scholar
- [18] . 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems. 7057–7067.Google Scholar
- [19] . 2010. SentiWordNet for Indian languages. In Proceedings of the Eighth Workshop on Asian Language Resources. 56–63.Google Scholar
- [20] . 2010. Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 107–116.Google Scholar
Digital Library
- [21] . 2018. Multilingual BERT -r. https://github.com/google-research/bert/blob/master/multilingual.md.Google Scholar
- [22] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.Google Scholar
- [23] . 1992. An argument for basic emotions. Cognition & Emotion 6, 3–4 (1992), 169–200.Google Scholar
Cross Ref
- [24] . 2006. SentiWordNet: A publicly available lexical resource for opinion mining. In LREC, Vol. 6. Citeseer, 417–422.Google Scholar
- [25] . 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524 (2017).Google Scholar
- [26] . 2018. Resource creation towards automated sentiment analysis in Telugu (a low resource language) and integrating multiple domain sources to enhance sentiment prediction. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar
- [27] . 2015. A lexicon-based approach for hate speech detection. International Journal of Multimedia and Ubiquitous Engineering 10, 4 (2015), 215–230.Google Scholar
Cross Ref
- [28] . 2019. Improved word sense disambiguation using pre-trained contextualized word representations. arXiv preprint arXiv:1910.00194 (2019).Google Scholar
- [29] . 2018. BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. Google Scholar
- [30] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [31] . 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 168–177.Google Scholar
Digital Library
- [32] . 2017. Automatic sarcasm detection: A survey. ACM Computing Surveys (CSUR) 50, 5 (2017), 1–22.Google Scholar
Digital Library
- [33] . 2016. Are word embedding-based features useful for sarcasm detection? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1006–1011.Google Scholar
Cross Ref
- [34] . 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).Google Scholar
- [35] . 2020. iNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 4948–4961.Google Scholar
- [36] . 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 655–665.Google Scholar
Cross Ref
- [37] . 2014. Analysis of variance (ANOVA) comparing means of more than two groups. Restorative Dentistry & Endodontics 39, 1 (2014), 74–77.Google Scholar
Cross Ref
- [38] . 2004. Determining the sentiment of opinions. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 1367.Google Scholar
Digital Library
- [39] . 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1746–1751.Google Scholar
Cross Ref
- [40] . 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302.Google Scholar
Digital Library
- [41] . 2018. Predictive embeddings for hate speech detection on Twitter. arXiv preprint arXiv:1809.10644 (2018).Google Scholar
- [42] . 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019).Google Scholar
- [43] . 2019. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).Google Scholar
- [44] . 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188–1196.Google Scholar
Digital Library
- [45] . 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
- [46] . 2021. Clickbait detection in Telugu: Overcoming NLP challenges in resource-poor languages using benchmarked techniques. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.Google Scholar
Cross Ref
- [47] . 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems. 6294–6305.Google Scholar
- [48] . 2012. Lyrics, music, and emotions. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 590–599.Google Scholar
Digital Library
- [49] . 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119.Google Scholar
Digital Library
- [50] . 2013. NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242 (2013).Google Scholar
- [51] . 2010. Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text. Association for Computational Linguistics, 26–34.Google Scholar
Digital Library
- [52] . 2013. Crowdsourcing a word–emotion association lexicon. Computational Intelligence 29, 3 (2013), 436–465.Google Scholar
Cross Ref
- [53] . 2007. Sentiment Composition. (2007).Google Scholar
- [54] . 2017. ACTSA: Annotated corpus for Telugu sentiment analysis. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems. 54–58.Google Scholar
Cross Ref
- [55] . 2017. Tag me a label with multi-arm: Active learning for Telugu sentiment analysis. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 355–367.Google Scholar
Cross Ref
- [56] . Beyond Words: Pictograms for Indian Languages. ([n.d.]).Google Scholar
- [57] . 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. 145–153.Google Scholar
Digital Library
- [58] . 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of ACL. 115–124.Google Scholar
Digital Library
- [59] . 2018. BCSAT: A benchmark corpus for sentiment analysis in Telugu using word-level annotations. In Proceedings of ACL 2018, Student Research Workshop. 99–104.Google Scholar
Cross Ref
- [60] . 2018. Enrichment of OntoSenseNet: Adding a sense-annotated Telugu lexicon. arXiv preprint arXiv:1804.02186 (2018).Google Scholar
- [61] . 2014. GloVe: Global Vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.Google Scholar
Cross Ref
- [62] . 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).Google Scholar
- [63] . 2001. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist 89, 4 (2001), 344–350.Google Scholar
Cross Ref
- [64] . 2016. Linguistically regularized LSTMs for sentiment classification. arXiv preprint arXiv:1611.03949 (2016).Google Scholar
- [65] . 2003. Using TF-IDF to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning, Vol. 242. Piscataway, NJ, 133–142.Google Scholar
- [66] . 2016. The emotional arcs of stories are dominated by six basic shapes. EPJ Data Science 5, 1 (2016), 31.Google Scholar
Cross Ref
- [67] . 2013. Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 704–714.Google Scholar
- [68] . 2008. Time for some a priori thinking about post hoc testing. Behavioral Ecology 19, 3 (2008), 690–693.Google Scholar
Cross Ref
- [69] . 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google Scholar
- [70] . 2017. Sentiment intensity ranking among adjectives using sentiment bearing word embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 547–552.Google Scholar
Cross Ref
- [71] . 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1631–1642.Google Scholar
- [72] . 2013. Emotions and information diffusion in social media—sentiment of microblogs and sharing behavior. Journal of Management Information Systems 29, 4 (2013), 217–248.Google Scholar
Cross Ref
- [73] . 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache.Google Scholar
- [74] . 2005. Exploiting a verb lexicon in automatic semantic role labelling. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 883–890.Google Scholar
Digital Library
- [75] . 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1422–1432.Google Scholar
Cross Ref
- [76] . 2008. Emotion classification using massive examples extracted from the web. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 881–888.Google Scholar
Digital Library
- [77] . 2018. Towards better sentence classification for morphologically rich languages. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing.Google Scholar
- [78] . 2018. Sentiment as a prior for movie rating prediction. In Proceedings of the 2nd International Conference on Innovation in Artificial Intelligence. 148–153.Google Scholar
Digital Library
- [79] . 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning. 977–984.Google Scholar
Digital Library
- [80] . 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. 88–93.Google Scholar
Cross Ref
- [81] . 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 347–354.Google Scholar
Digital Library
- [82] . 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP. 120–130.Google Scholar
Cross Ref
- [83] . 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604 (2016).Google Scholar
- [84] . 2007. Building emotion lexicon from weblog corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 133–136.Google Scholar
Cross Ref
- [85] . 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems. 5754–5764.Google Scholar
- [86] . 2006. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation. Springer, 731–740.Google Scholar
Cross Ref
- [87] . 2010. Multi-level structured models for document-level sentiment classification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1046–1056.Google Scholar
Digital Library
- [88] . 2016. Learning word meta-embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1351–1360.Google Scholar
Cross Ref
- [89] . 2011. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning 85, 1–2 (2011), 41–75.Google Scholar
Digital Library
- [90] . 2017. Refining word embeddings for sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 534–539.Google Scholar
Cross Ref
- [91] . 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018), e1253.Google Scholar
Cross Ref
- [92] . 2016. Tweet sarcasm detection using deep neural network. In Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers. 2449–2460.Google Scholar
- [93] . 2015. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems 28 (2015), 649–657.Google Scholar
- [94] . 2018. Detecting hate speech on Twitter using a convolution-GRU based deep neural network. In European Semantic Web Conference. Springer, 745–760.Google Scholar
Digital Library
- [95] . 2015. Self-adaptive hierarchical sentence model. In Twenty-Fourth International Joint Conference on Artificial Intelligence.Google Scholar
- [96] . 2016. Emotion distribution learning from texts. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 638–647.Google Scholar
Cross Ref
- [97] . 2016. Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. arXiv preprint arXiv:1611.06639 (2016).Google Scholar
- [98] . 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
Digital Library
Index Terms
Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP Tasks in Telugu Language
Recommendations
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such ...
Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification
AbstractCode-mixing and code-switching are frequent features in online conversations. Classification of such text is challenging if one of the languages is low-resourced. Fine-tuning pre-trained multilingual language models is a promising avenue for code-...
Dependency Parser for Telugu Language
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive StrategiesIn Telugu language sentence if we change the word order its meaning was not changed whereas in English if we change the word order the meaning was changed. So Telugu is morphologically rich so it is very difficult to develop syntactic parsers for these ...






Comments