Abstract
Code-switching or the juxtaposition of linguistic units from two or more languages in a single utterance, has, in recent times, become very common in text, thanks to social media and other computer mediated forms of communication. In this exploratory study of English-Hindi code-switching on Twitter, we automatically create a large corpus of code-switched tweets and devise techniques to identify the relationship between successive components in a code-switched tweet. More specifically, we identify pragmatic functions such as narrative-evaluative, negative reinforcement, translation or semantically equivalent statements, and so on characterizing the relation between successive components. We analyze the difference/similarity between switching patterns in code-switched and monolingual multi-component tweets. We observe strong dominance of narrative-evaluative (non-opinion to opinion or vice versa) switching in case of both code-switched and monolingual multi-component tweets in around 40% of cases. Polarity switching appears to be a prevalent switching phenomenon (10%) specifically in code-switched tweets (three to four times higher than monolingual multi-component tweets) where preference of expressing negative sentiment in Hindi is approximately twice compared to English. Positive reinforcement appears to be an important pragmatic function for English multi-component tweets, whereas negative reinforcement plays a key role for Devanagari multi-component tweets. Our results also indicate that the extent and nature of code-switching also strongly depend on the topic (sports, politics, etc.) of discussion.
- Prabhat Agarwal, Ashish Sharma, Jeenu Grover, Mayank Sikka, Koustav Rudra, and Monojit Choudhury. 2017. I may talk in English but gaali toh Hindi mein hi denge: A study of English-Hindi code-switching and swearing pattern on social networks. In Proceedings of the 9th International Conference on Social Networking Workshop, Communication Systems and Networks (COMSNETS’17). IEEE, 554--557.Google Scholar
Cross Ref
- Mohamed Al-Badrashiny and Mona Diab. 2016. LILI: A simple language independent approach for language identification. In Proceedings of the 26th International Conference on Computational Linguistics. 1211--1219.Google Scholar
- Jannis Androutsopoulos. 2015. Networked multilingualism: Some language practices on Facebook and their implications. Int. J. Biling. 19, 2 (2015), 185--205.Google Scholar
Cross Ref
- Elayaperumal Annamalai. 2001. Managing Multilingualism in India: Political and Linguistic Manifestations, Vol. 8. SAGE Publications Pvt. Limited.Google Scholar
- Peter Auer. 1995. The pragmatics of code-switching: A sequential approach. In One Speaker, Two Languages. Cambridge University Press, 115--135.Google Scholar
- Akshat Bakliwal, Piyush Arora, and Vasudeva Varma. 2012. Hindi subjective lexicon: A lexical resource for Hindi adjective polarity classification. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). ELRA, 1189--1196.Google Scholar
- Kalika Bali, Jatin Sharma, Monojit Choudhury, and Yogarshi Vyas. 2014. “I am borrowing ya mixing?” An analysis of English-Hindi code mixing in Facebook. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Empirical Methods in Natural Language Processing (EMNLP’14). 116--126.Google Scholar
Cross Ref
- Somnath Banerjee, Sudip Kumar Naskar, Paolo Rosso, and Sivaji Bandyopadhyay. 2016. The first cross-script code-mixed question answering corpus. In Proceedings of the Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine’16). 56--65.Google Scholar
- Utsab Barman, Amitava Das, Joachim Wagner, and Jennifer Foster. 2014. Code mixing: A challenge for language identification in the language of social media. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Empirical Methods in Natural Language Processing (EMNLP’14). 13--23.Google Scholar
Cross Ref
- Inma Muñoa Barredo. 1997. Pragmatic functions of code-switching among Basque-Spanish bilinguals. Retrieved on October 26 (1997), 528--541. http://ssl.webs.uvigo.es/actas1997/04/Munhoa.pdf.Google Scholar
- Rafiya Begum, Kalika Bali, Monojit Choudhury, Koustav Rudra, and Niloy Ganguly. 2016. Functions of code-switching in tweets: An annotation scheme and some initial experiments. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1644--1650.Google Scholar
- Erman Boztepe. 2003. Issues in code-switching: Competing theories and models. Teacher’s College Columbia University Working Papers in TESOL and Applied Linguistics 3, 2.Google Scholar
- Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. Predicting depression via social media. In Proceedings of the 7th International AAAI Conference on Weblogs and Social Media (ICWSM’13). 128--137.Google Scholar
- Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. In Proceedings of the 51st Meeting of the Association for Computational Linguistics. Vol. 1. ACL, 250--259.Google Scholar
- Jean-Marc Dewaele and Wei Li. 2014. Intra-and inter-individual variation in self-reported code-switching patterns of adult multilinguals. Int. J. Multiling. 11, 2 (2014), 225--246.Google Scholar
Cross Ref
- Jean-Marc Dewaele and Li Wei. 2014. Attitudes towards code-switching among adult mono-and multilingual language users. J. Multiling. Multicult. Dev. 35, 3 (2014), 235--251.Google Scholar
Cross Ref
- Anik Dey and Pascale Fung. 2014. A Hindi-English code-switching corpus. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 2410--2413.Google Scholar
- Ahmad Abdel Tawwab Sharaf Eldin. 2014. Socio linguistic study of code switching of the arabic language speakers on social networking. Int. J. Eng. Ling. 4, 6 (2014), 78.Google Scholar
- Andrew Finch, Lemao Liu, Xiaolin Wang, and Eiichiro Sumita. 2016. Target-bidirectional neural models for machine transliteration. In Proceedings of the 6th Named Entity Workshop. Association for Computational Linguistics, 78--82.Google Scholar
Cross Ref
- J. A. Fishman. 1971. Sociolinguistics. Rowley, Newbury, MA.Google Scholar
- Björn Gambäck and Amitava Das. 2016. Comparing the level of code-switching in corpora. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google Scholar
- Archana Garg, Vishal Gupta, and Manish Jindal. 2014. A survey of language identification techniques and applications. J. Emerg. Technol. Web Intell. 6, 4 (2014), 388--400.Google Scholar
- Spandana Gella, Jatin Sharma, and Kalika Bali. 2013. Query word labeling and back transliteration for indian languages: Shared task system description. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’13) Working Notes.Google Scholar
- François Grosjean. 1982. Life with Two Languages: An Introduction to Bilingualism. Harvard University Press, Cambridge, MA.Google Scholar
- John J. Gumperz. 1982. Discourse Strategies. Vol. 1. Cambridge University Press, Cambridge, UK.Google Scholar
- John. J. Gumprez and E. Hernández-Chávez. 1972. Bilingualism, bidialectalism and classroom interaction. In Language in Social Groups. Stanford University Press, Stanford, CA. 311--339.Google Scholar
- Kanika Gupta, Monojit Choudhury, and Kalika Bali. 2012. Mining Hindi-English transliteration pairs from online Hindi lyrics. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 2459--2465.Google Scholar
- Gualberto A Guzmán, Joseph Ricard, Jacqueline Serigos, Barbara Bullock, and Almeida Jacqueline Toribio. 2017. Moving code-switching research toward more empirically grounded methods. In Proceedings of the Workshop on Corpora in the Digital Humanities ([email protected]’17). 1--9.Google Scholar
- Hindisentiwordnet. 2015. Hindi SentiWordnet—Sentiment Lexicon for Hindi. Retrieved from http://www.cfilt.iitb.ac.in/resources/senti/HSWN_downloaderInfo.php.Google Scholar
- BBC. 2012. English or Hinglish—which will India choose? Retrieved from http://www.bbc.com/news/magazine-20500312.Google Scholar
- A. K. Joshi. 1985. Processing of sentences with intrasentential code switching. In Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives. Cambridge University Press, Cambridge, UK. 190--205.Google Scholar
- David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. 2017. Incorporating dialectal variability for socially equitable language identification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vol. 2. 51--57.Google Scholar
Cross Ref
- Braj Kachru. 1978. Code-mixing as a Communicative Strategy in India. Georgetown University Press, Washington, DC. 107--124 pages.Google Scholar
- Mitesh M. Khapra, Ananthakrishnan Ramanathan, Anoop Kunchukuttan, Karthik Visweswariah, and Pushpak Bhattacharyya. 2014. When transliteration met crowdsourcing: An empirical study of transliteration via crowdsourcing using efficient, non-redundant and fair quality control. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 196--202.Google Scholar
- Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2016. Freshman or fresher? Quantifying the geographic variation of language in online social media. In Proceedings of the 10th International AAAI Conference on Weblogs and Social Media (ICWSM’16). 615--618.Google Scholar
- William Labov. 1971. The Notion of System in Creole Languages. Cambridge University Press, Cambridge, UK. 447--472.Google Scholar
- Hanna Lantto. 2014. Code-switching, swearing and slang: The colloquial register of Basque in greater Bilbao. Int. J. Biling. 18, 6 (2014), 633--648.Google Scholar
Cross Ref
- Jeff MacSwan. 2014. A Minimalist Approach to Intrasentential Code Switching. Routledge, Abingdon, UK.Google Scholar
- Sunita Malhotra. 1980. Hindi-English code-switching and language choice in urban upper-middle-class Indian families. Kansas Working Papers in Linguistics 5, 2 (1980), 39--46.Google Scholar
- Yael Maschler. 1991. The language games bilinguals play: Language alternation at language boundaries. Language and Communication 11, 4 (1991), 263--289.Google Scholar
Cross Ref
- Yael Maschler. 1994. Appreciation ha’araxa ’o ha’arasta? {valuing or admiration}. Negotiating Contrast in Bilingual Disagreement Talk 14, 2 (1994), 207--238.Google Scholar
- MicrosoftAPI 2017. Microsoft Translator Text API. Retrieved from https://www.microsoft.com/en-us/translator/business/translator-api/.Google Scholar
- Saif Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. 2013. NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. In Proceedings of the 7th International Workshop on Semantic Evaluation Exercises (SemEval’13). Retrieved from https://arxiv.org/abs/1308.6242.Google Scholar
- Saif M. Mohammad and Peter D. Turney. 2013. Crowdsourcing a word-emotion association lexicon. Comput. Intell. 29, 3 (2013), 436--465.Google Scholar
Cross Ref
- Dong Nguyen and Leonie Cornips. 2016. Automatic detection of intra-word code-switching. In Proceedings of the 14th Workshop on Computational Research in Phonetics, Phonology, and Morphology (SIGMORPHON’16)). ACL, 82--86.Google Scholar
Cross Ref
- Miwa Nishimura. 1995. A functional analysis of Japanese/English code-switching. J. Pragmatics 23, 2 (1995), 157--181.Google Scholar
Cross Ref
- Umangi Oza, Rashmi Prasad, Sudheer Kolachina, Dipti Misra Sharma, and Aravind Joshi. 2009. The Hindi discourse relation bank. In Proceedings of the 3rd Linguistic Annotation Workshop (LAW’09). ACL, 158--161. Google Scholar
Digital Library
- Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’02). 79--86. Google Scholar
Digital Library
- Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Abhipsa Basu, Prithwish Mukherjee, Monojit Choudhury, and Animesh Mukherjee. 2017. All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2264--2274.Google Scholar
Cross Ref
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825--2830. Google Scholar
Digital Library
- Carol W. Pfaff. 1979. Constraints on language mixing: Intrasentential code-switching and borrowing in Spanish/English. Language 55, 2 (1979), 291--318.Google Scholar
Cross Ref
- Simone Paolo Ponzetto and Michael Strube. 2007. Knowledge derived from Wikipedia for computing semantic relatedness. J. Artific. Intell. Res. 30 (2007), 181--212. Google Scholar
Digital Library
- Shana Poplack. 1980. Sometimes I’ll start a sentence in Spanish y termino en español: Toward a typology of code-switching. Linguistics 18, 7--8 (1980), 581--618.Google Scholar
Cross Ref
- Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K. Joshi, and Bonnie L. Webber. 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
- Ashequl Qadir. 2009. Detecting opinion sentences specific to product features in customer reviews using typed dependency relations. In Proceedings of the Workshop on Events in Emerging Text Types (eETTs’09). Association for Computational Linguistics, 38--43. Google Scholar
Digital Library
- Khyathi Chandu Raghavi, Manoj Kumar Chinnakotla, and Manish Shrivastava. 2015. “Answer ka type kya he?”: Learning to classify questions in code-mixed language. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). ACM, 853--858. Google Scholar
Digital Library
- Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, and Chandra Shekhar Maddila. 2017. Estimating code-switching on Twitter with a novel generalized word-level language detection technique. In Proceedings of the 55th Meeting of the Association for Computational Linguistics, Vol. 1. 1971--1982.Google Scholar
Cross Ref
- Alan Ritter, Sam Clark, Mausam Etzioni, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 1524--1534. Google Scholar
Digital Library
- Glívia Angélica Rodrigues Barbosa, Ismael S. Silva, Mohammed Zaki, Wagner Meira, Jr., Raquel O. Prates, and Adriano Veloso. 2012. Characterizing the effectiveness of Twitter hashtags to detect and track online population sentiment. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI EA’12). 2621--2626. Google Scholar
Digital Library
- Suzanne Romaine. 1989. Bilingualism. Blackwell, Oxford, UK.Google Scholar
- Koustav Rudra, Niloy Ganguly, Pawan Goyal, and Saptarshi Ghosh. 2018. Extracting and summarizing situational information from the Twitter social media during disasters. ACM Trans. Web 12, 3, Article 17 (2018), 35 pages. Google Scholar
Digital Library
- Koustav Rudra, Subham Ghosh, Niloy Ganguly, Pawan Goyal, and Saptarshi Ghosh. 2015. Extracting situational information from microblogs during disaster events: A classification-summarization approach. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). ACM, 583--592. Google Scholar
Digital Library
- Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury, and Niloy Ganguly. 2016. Understanding language preference for expression of opinion and sentiment: What do Hindi-English speakers do on Twitter? In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 1131--1141.Google Scholar
Cross Ref
- Rosaura Sánchez. 1983. Chicano Discourse: Socio-historic Perspectives. Arte Público Press, University of Houston, Houston, TX.Google Scholar
- Carol Scotton and William Ury. 1977. Bilingual strategies: The social functions of codeswitching. Int. J. Soc. Lang. 13 (1977), 5--20.Google Scholar
- Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma. 2015. Overview of FIRE-2015 shared task on mixed script information retrieval. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’15). 21--27.Google Scholar
- Anug Si. 2011. A diachronic investigation of Hindi-English code-switching, using Bollywood film scripts. Int. J. Biling. 15, 4 (2011), 388--407.Google Scholar
Cross Ref
- Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Gohneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang, Pascale Fung. 2014. Overview for the first shared task on language identification in code-switched data. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Empirical Methods in Natural Language Processing (EMNLP’14). 62--72.Google Scholar
Cross Ref
- Thamar Solorio and Yang Liu. 2008. Learning to predict code-switching points. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 973--981. Google Scholar
Digital Library
- Thamar Solorio and Yang Liu. 2008. Part-of-speech tagging for English-Spanish code-switched text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 1051--1060. Google Scholar
Digital Library
- Nagesh Bhattu Sristy, N. Satya Krishna, B. Shiva Krishna, and Vadlamani Ravi. 2017. Language identification in mixed script. In Proceedings of the 9th Forum for Information Retrieval Evaluation (FIRE’17). 14--20. Google Scholar
Digital Library
- Simo Tchokni, D. O. Séaghdha, and Daniele Quercia. 2014. Emoticons and phrases: Status symbols in social media. In Proceedings of the 8th International AAAI Conference on Weblogs and Social Media (ICWSM’14). 485--494.Google Scholar
- Catriona Tullo and James Hurford. 2003. Modelling Zipfian distributions in language. In Proceedings of the 15th European Summer School on Logic Language and Information (ESSLLI’03). 62--75.Google Scholar
- Twitter-language-api 2015. GET help/languages | Twitter Developers. Retrieved from https://dev.twitter.com/rest/reference/get/help/languages.Google Scholar
- Twitter-search-api 2015. GET search/tweets | Twitter Developers. Retrieved from https://dev.twitter.com/rest/reference/get/search/tweets.Google Scholar
- Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring sentiment in social media: Bootstrapping subjectivity clues from multilingual Twitter streams. In Proceedings of the 51st Meeting of the Association for Computational Linguistics, Vol. 2. 505--510.Google Scholar
- Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika Bali, and Monojit Choudhury. 2014. POS tagging of English-Hindi code-mixed social media content. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 974--979.Google Scholar
Cross Ref
- Meng Xuan Xia and Jackie Chi Kit Cheung. 2016. Accurate Pinyin-English codeswitched language identification. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP’16). ACL, 71--79.Google Scholar
Cross Ref
- Andrew Yates, Arman Cohan, and Nazli Goharian. 2017. Depression and self-harm risk assessment in online forums. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). ACL, 2968--2978.Google Scholar
Cross Ref
Index Terms
Identifying and Analyzing Different Aspects of English-Hindi Code-Switching in Twitter
Recommendations
A System for English Vocabulary Acquisition based on Code-Switching
Vocabulary plays an important part in second language learning and there are many existing techniques to facilitate word acquisition. One of these methods is code-switching, or mixing the vocabulary of two languages in one sentence. In this paper the ...
Word2Vec based spelling correction method of Twitter message
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied ComputingTwitter1 became popular owing to the devices like smartphones and tablets, with which short messages can be easily composed. Due to the popularity of Twitter, the volume of Twitter messages has increased rapidly. Accordingly, studies have been carried ...
A sentiment analysis of audiences on twitter: who is the positive or negative audience of popular twitterers?
ICHIT'11: Proceedings of the 5th international conference on Convergence and hybrid information technologyMicroblogging is a new informal communication medium of blogging that differs from a traditional blog in which content is much shorter. Microbloggers post about topics that describe their current status. Twitter is a popular microblogging service and ...






Comments