skip to main content
research-article

LFWE: Linguistic Feature Based Word Embedding for Hindi Fake News Detection

Published:16 June 2023Publication History
Skip Abstract Section

Abstract

It is essential for research communities to investigate ways for authenticating news. The use of linguistic feature based analysis to automatically detect false news is gaining popularity among the scientific community. However, such techniques are exclusively created for English, leaving low-resource languages like Hindi behind. To address this issue, we constructed a novel annotated Hindi Fake News (HinFakeNews) dataset of roughly 33,300 articles that can be utilized to develop autonomous fake news detection systems. This work provides a two-stage benchmark model for identifying fake news in Hindi using machine learning. The proposed model, LFWE (Linguistic Feature Based Word Embedding), generates word embedding over linguistic features. This article focuses on 23 key linguistic features (15 extracted and 08 derived) for successful detection of Hindi fake news. These features are grouped as lexical, semantic, syntactic, psycho-linguistic, readability, and quantity features. The contribution is twofold. In the first phase, the dataset is preprocessed and linguistic features are extracted. In the second phase, feature sets are generated as word embeddings, and an Ensemble voting classification is carried out on the feature sets. According to experimental findings, the LFWE model accurately detects and classifies fake news in Hindi with an accuracy of 98.49%.

REFERENCES

  1. [1] S. Rukmini. 2019. In India, who speaks in English, and where? from. https://www.livemint.com/news/india/in-india-who-speaks-in-english-and-where-1557814101428.html.Google ScholarGoogle Scholar
  2. [2] Abonizio Hugo Queiroz, Morais Janaina Ignacio De, Tavares Gabriel Marques, and Junior Sylvio Barbon. 2020. Language independent fake news detection: English, Portuguese and Spanish mutual features. MDPI Future Internet 12, 5 (May 2020), 87. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Akbar Syeda Zainab, Kukreti Divyanshu, Sagarika Somya, and Pal Joyojeet. 2020. Temporal Patterns in COVID-19 Related Digital Misinformation in India. Retrieved April 7, 2023 from http://joyojeet.people.si.umich.edu/temporal-patterns-in-covid-19-misinformation-in-india/.Google ScholarGoogle Scholar
  4. [4] Alhindi Tariq, Petridis Savvas, and Muresan Smaranda. 2018. Where is your evidence: Improving fact-checking by justification modeling. In Proceedings of the 1st Workshop on Fact Extraction and Verification (FEVER’18), Vol. 14. 8590. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Allcott Hunt and Gentzkow Matthew. 2017. Social media and fake news in the 2016 election. Journal of Economic Perspectives 31, 2 (2017), 211–36. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Alrubaian Majed, Al-Qurishi Muhammad, Alamri Atif, Al-Rakhami Mabrook, Hassan, and Fortino G.. 2019. Credibility in online social networks: A survey. IEEE Access 7 (2019), 28282855. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Amjad Maaz, Sidorov Grigori, and Zhila Alisa. 2020. Data augmentation using machine translation for fake news detection in the Urdu language. In Proceedings of the 12th Language Resources and Evaluation Conference. 25372542. https://aclanthology.org/2020.lrec-1.309.Google ScholarGoogle Scholar
  8. [8] Apuke Oberiri Destiny and Omar Bahiyah. 2021. Fake news and COVID-19: Modelling the predictors of fake news sharing among social media users. Telematics and Informatics 56 (Jan.2021), 101475. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Augenstein Isabelle, Lioma Christina, Wang Dongsheng, Lima Lucas Chaves, Hansen Casper, Hansen Christian, and Simonsen Jakob Grue. 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 46854697. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Badam Jathin, Bonagiri Akash, Raju K., and Chakraborty Dipanjan. 2022. Aletheia: A fake news detection system for Hindi. In Proceedings of the 5th Joint International Conference on Data Science and Management of Data (CODS-COMAD’22). ACM, New York, NY, 255259. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] BoomLive Hindi. 2021. Home Page. Retrieved March 31, 2021 from https://hindi.boomlive.in/.Google ScholarGoogle Scholar
  12. [12] Bourgonje Peter, Schneider Julian Moreno, and Rehm Georg. 2017. From clickbait to fake news detection: An approach based on detecting the stance of headlines to articles. In Proceedings of the 12th Language Resources and Evaluation Conference. 8489. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Bovet Alexandre and Makse Hernan A.. 2019. Influence of fake news in Twitter during the 2016 U.S presidential election. Nature Communications 10, 7 (Jan.2019), 114. Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Castillo Carlos, Mendoza Marcelo, and Poblete Barbara. 2011. Information credibility on Twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 675684. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Newschecker. 2021. Home Page. Retrieved March 31, 2021 from https://newschecker.in/hi/hindi/.Google ScholarGoogle Scholar
  16. [16] Crescendo Fact. 2021. Home Page. Retrieved March 31, 2021 from https://www.factcrescendo.com/.Google ScholarGoogle Scholar
  17. [17] Garg Sonal and Sharma Dilip Kumar. 2022. Linguistic features-based framework for automatic fake news detection. Computers & Industrial Engineering 172, A (Oct. 2022), 108432. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Glenski Maria, Weninger Tim, and Volkova Svitlana. 2018. Propagation from deceptive news sources who shares, how much, how evenly, and how quickly? IEEE Transactions on Computational Social Systems 5, 4 (Dec. 2018), 10711082. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Gravanis Georgios, Vakali Athena, Diamantaras Kostas, and Karadais Panagiotis. 2019. Behind the cues: A benchmarking study for fake news detection. Expert Systems 128 (Aug. 2019), 201213. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Grave Edouard, Bojanowski Piotr, Gupta Prakhar, Joulin Armand, and Mikolov Tomas. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). https://aclanthology.org/L18-1550.Google ScholarGoogle Scholar
  21. [21] Horne Benjamin D. and Adali Sibel. 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 11. 759766. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Hossain Zobaer, Rahman Ashraful, Islam Saiful, and Kar Sudipta. 2020. BanFakeNews: A dataset for detecting fake news in Bangla. In Proceedings of the 12th Language Resources and Evaluation Conference. 28622871. https://aclanthology.org/2020.lrec-1.349.Google ScholarGoogle Scholar
  23. [23] Jin Zhiwei, Cao Juan, Zhang Yongdong, Zhou Jianshe, and Tian Qi. 2017. Novel visual and statistical image features for microblogs news verification. IEEE Transactions on Multimedia 19, 3 (March 2017), 598608. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Khanam Z., Alwasel B. N., Sirafi H., and Rashid M.. 2020. Fake news detection using machine learning approaches. In Proceedings of the International Conference on Applied Scientific Computational Intelligence Using Data Science (ASCI’20), Vol. 1099. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Khattar Dhruv, Goud Jaipal Singh, Gupta Manish, and Varma Vasudeva. 2019. MVAE: Multimodal variational autoencoder for fake news detection. In Proceedings of the World Wide Web Conference (WWW’19). ACM, New York, NY, 29152921. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Kumar Sudhanshu and Singh Thoudam Doren. 2022. Fake news detection on Hindi news dataset. Global Transitions Proceedings 3, 1 (June 2022), 289297. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Kunchukuttan Anoop and Bhattacharyya Pushpak. 2016. Orthographic syllable as basic unit for SMT between related languages. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 19121917. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Lancaster Eric, Chakraborty Tanmoy, and Subrahmanian V. S.. 2018. MALTP: Parallel prediction of malicious tweets. IEEE Transactions on Computational Social Systems 5, 4 (Dec. 2018), 10961108. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Mahabub Atik. 2020. A robust technique of fake news detection using ensemble voting classifier and comparison with other classifiers. SN Applied Sciences 2 (2020), 525. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] News Alt. 2021. Home Page. Retrieved March 31, 2021 from https://www.altnews.in/hindi/.Google ScholarGoogle Scholar
  31. [31] News Asianet. 2021. Home Page. Retrieved March 31, 2021 from https://hindi.asianetnews.com/.Google ScholarGoogle Scholar
  32. [32] News Vishwas. 2021. Home Page. Retrieved March 31, 2021 from https://www.vishvasnews.com/.Google ScholarGoogle Scholar
  33. [33] Nunes Thiago. 2020. Hyperparameter Tuning - Brief Theory and What You Won’t Find in the Handbook. Retrieved July 1, 2022 from https://medium.com/analytics-vidhya/hyperparameter-tuning-8ca311b16057.Google ScholarGoogle Scholar
  34. [34] Pandey Anshuman. 2016. Tokenizing Indic Strings by Syllables in Python. Retrieved July 1, 2022 from https://pandey.github.io/posts/tokenize-indic-syllables-python.html.Google ScholarGoogle Scholar
  35. [35] Panjwani Ritesh, Kanojia Diptesh, and Bhattacharyya Pushpak. 2022. pyiwn: A Python-Based API to Access Indian Language WordNets. Retrieved August 30, 2022 from https://aclanthology.org/2018.gwc-1.47.pdf.Google ScholarGoogle Scholar
  36. [36] Quint-Webqoof The. 2021. Home Page. Retrieved March 31, 2021 from https://hindi.thequint.com/news/webqoof/.Google ScholarGoogle Scholar
  37. [37] Ranganath Suhas, Wang Suhang, Hu Jiliang Tang Xia, and Liu Huan. 2017. Facilitating time critical information seeking in social media. IEEE Transactions on Knowledge and Data Engineering 29, 10 (Oct.2017), 21972209. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Reis Julio C. S., Melo Philipe, Garimella Kiran, Almeida Jussara M., Eckles Dean, and Benevenuto Fabricio. 2020. A dataset of fact-checked images shared on WhatsApp during the Brazilian and Indian elections. In Proceedings of the 14th International AAAI Conference on Web and Social Media, Vol. 14. 903908. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Salve Prachi. 2020. Manipulative Fake News on the Rise in India Under Lockdown Study. Retrieved August 12, 2021 from https://indiaspend.com/manipulative-fake-news-on-the-rise-in-india-under-lockdown-study/.Google ScholarGoogle Scholar
  40. [40] Santos Roney, Pedro Gabriela, Leal Sidney, Vale Oto, Pardo Thiago, Bontcheva Kalina, and Scarton Carolina. 2020. Measuring the impact of readability features in fake news detection. In Proceedings of the 12th Language Resources and Evaluation Conference. 14041413. https://aclanthology.org/2020.lrec-1.176.Google ScholarGoogle Scholar
  41. [41] Saxena Akanksha. 2021. India Fake News Problem Fueled by Digital Illiteracy. Retrieved December 8, 2021 from https://www.dw.com/en/india-fake-news-problem-fueled-by-digital-illiteracy/a-56746776.Google ScholarGoogle Scholar
  42. [42] Sequeira Ryan, Gayen Avijit, Ganguly Niloy, Dandapat Sourav Kumar, and Chandra Joydeep. 2019. A large-scale study of the Twitter follower network to characterize the spread of prescription drug abuse tweets. IEEE Transactions on Computational Social Systems 6, 6 (Dec, 2019), 12321244. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Sharma Dilip K. and Garg Sonal. 2021. Machine learning methods to identify Hindi fake news within social media. In Proceedings of the 12th International Conference on Computing Communication and Networking Technologies (ICCCNT’21). IEEE, Los Alamitos, CA, 16. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Sinha Manjira, Sharma Sakshi, Dasgupta Tirthankar, and Basu Anupam. 2012. New readability measures for Bangla and Hindi texts. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 11411150.Google ScholarGoogle Scholar
  45. [45] Verma Pawan Kumar and Agrawal Prateek. 2020. Study and detection of fake news: P2C2-based machine learning approach. In Data Management, Analytics and Innovation.Advances in Intelligent Systems and Computing, Vol. 1175. Springer, 261278. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Vlachos Andreas and Riedel Sebastian. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. 1822. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Vosoughi Soroush, Roy Deb, and Aral Sinan. 2018. The spread of true and false news online. Science 359, 6380 (mar2018), 11461151. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wang William Yang. 2017. Liar, liar pants on fire: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 2. Association for Computational Linguistics, 422426. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Yaqing, Ma Fenglong, Jin Zhiwei, Yuan Ye, Xun Guangxu, Jha Kishlay, Su Lu, and Gao Jing. 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD’18). Association for Computing Machinery, 849857. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zhang Zhiyong, Sun Ranrun, Wang Xiaoxue, and Zhao Changwei. 2019. A situational analytic method for user behavior pattern in multimedia social networks. IEEE Transactions on Big Data 5, 4 (Dec.2019), 520528. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhou Xinyi, Wu Jindi, and Zafarani Reza. 2020. SAFE: Similarity-aware multi-modal fake news detection. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Vol. 12085. Springer, Cham, 849857. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LFWE: Linguistic Feature Based Word Embedding for Hindi Fake News Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
          June 2023
          635 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3604597
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 June 2023
          • Online AM: 31 March 2023
          • Accepted: 24 March 2023
          • Revised: 17 March 2023
          • Received: 22 February 2023
          Published in tallip Volume 22, Issue 6

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)124
          • Downloads (Last 6 weeks)17

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!