Abstract
It is essential for research communities to investigate ways for authenticating news. The use of linguistic feature based analysis to automatically detect false news is gaining popularity among the scientific community. However, such techniques are exclusively created for English, leaving low-resource languages like Hindi behind. To address this issue, we constructed a novel annotated Hindi Fake News (HinFakeNews) dataset of roughly 33,300 articles that can be utilized to develop autonomous fake news detection systems. This work provides a two-stage benchmark model for identifying fake news in Hindi using machine learning. The proposed model, LFWE (Linguistic Feature Based Word Embedding), generates word embedding over linguistic features. This article focuses on 23 key linguistic features (15 extracted and 08 derived) for successful detection of Hindi fake news. These features are grouped as lexical, semantic, syntactic, psycho-linguistic, readability, and quantity features. The contribution is twofold. In the first phase, the dataset is preprocessed and linguistic features are extracted. In the second phase, feature sets are generated as word embeddings, and an Ensemble voting classification is carried out on the feature sets. According to experimental findings, the LFWE model accurately detects and classifies fake news in Hindi with an accuracy of 98.49%.
- [1] S. Rukmini. 2019. In India, who speaks in English, and where? from. https://www.livemint.com/news/india/in-india-who-speaks-in-english-and-where-1557814101428.html.Google Scholar
- [2] . 2020. Language independent fake news detection: English, Portuguese and Spanish mutual features. MDPI Future Internet 12, 5 (May 2020), 87.
DOI: Google ScholarCross Ref
- [3] . 2020. Temporal Patterns in COVID-19 Related Digital Misinformation in India. Retrieved April 7, 2023 from http://joyojeet.people.si.umich.edu/temporal-patterns-in-covid-19-misinformation-in-india/.Google Scholar
- [4] . 2018. Where is your evidence: Improving fact-checking by justification modeling. In Proceedings of the 1st Workshop on Fact Extraction and Verification (FEVER’18), Vol. 14. 85–90.
DOI: Google ScholarCross Ref
- [5] . 2017. Social media and fake news in the 2016 election. Journal of Economic Perspectives 31, 2 (2017), 211–36.
DOI: Google ScholarCross Ref
- [6] . 2019. Credibility in online social networks: A survey. IEEE Access 7 (2019), 2828–2855.
DOI: Google ScholarCross Ref
- [7] . 2020. Data augmentation using machine translation for fake news detection in the Urdu language. In Proceedings of the 12th Language Resources and Evaluation Conference. 2537–2542. https://aclanthology.org/2020.lrec-1.309.Google Scholar
- [8] . 2021. Fake news and COVID-19: Modelling the predictors of fake news sharing among social media users. Telematics and Informatics 56 (
Jan. 2021), 101475. Google ScholarCross Ref
- [9] . 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 4685–4697.
DOI: Google ScholarCross Ref
- [10] . 2022. Aletheia: A fake news detection system for Hindi. In Proceedings of the 5th Joint International Conference on Data Science and Management of Data (CODS-COMAD’22). ACM, New York, NY, 255–259.
DOI: Google ScholarDigital Library
- [11] . 2021. Home Page. Retrieved March 31, 2021 from https://hindi.boomlive.in/.Google Scholar
- [12] . 2017. From clickbait to fake news detection: An approach based on detecting the stance of headlines to articles. In Proceedings of the 12th Language Resources and Evaluation Conference. 84–89.
DOI: Google ScholarCross Ref
- [13] . 2019. Influence of fake news in Twitter during the 2016 U.S presidential election. Nature Communications 10, 7 (
Jan. 2019), 1–14. Google ScholarCross Ref
- [14] . 2011. Information credibility on Twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 675–684.
DOI: Google ScholarDigital Library
- [15] . 2021. Home Page. Retrieved March 31, 2021 from https://newschecker.in/hi/hindi/.Google Scholar
- [16] . 2021. Home Page. Retrieved March 31, 2021 from https://www.factcrescendo.com/.Google Scholar
- [17] . 2022. Linguistic features-based framework for automatic fake news detection. Computers & Industrial Engineering 172, A (Oct. 2022), 108432.
DOI: Google ScholarDigital Library
- [18] . 2018. Propagation from deceptive news sources who shares, how much, how evenly, and how quickly? IEEE Transactions on Computational Social Systems 5, 4 (Dec. 2018), 1071–1082.
DOI: Google ScholarCross Ref
- [19] . 2019. Behind the cues: A benchmarking study for fake news detection. Expert Systems 128 (Aug. 2019), 201–213.
DOI: Google ScholarDigital Library
- [20] . 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). https://aclanthology.org/L18-1550.Google Scholar
- [21] . 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 11. 759–766.
DOI: Google ScholarCross Ref
- [22] . 2020. BanFakeNews: A dataset for detecting fake news in Bangla. In Proceedings of the 12th Language Resources and Evaluation Conference. 2862–2871. https://aclanthology.org/2020.lrec-1.349.Google Scholar
- [23] . 2017. Novel visual and statistical image features for microblogs news verification. IEEE Transactions on Multimedia 19, 3 (March 2017), 598–608.
DOI: Google ScholarDigital Library
- [24] . 2020. Fake news detection using machine learning approaches. In Proceedings of the International Conference on Applied Scientific Computational Intelligence Using Data Science (ASCI’20), Vol. 1099.
DOI: Google ScholarCross Ref
- [25] . 2019. MVAE: Multimodal variational autoencoder for fake news detection. In Proceedings of the World Wide Web Conference (WWW’19). ACM, New York, NY, 2915–2921.
DOI: Google ScholarDigital Library
- [26] . 2022. Fake news detection on Hindi news dataset. Global Transitions Proceedings 3, 1 (June 2022), 289–297.
DOI: Google ScholarCross Ref
- [27] . 2016. Orthographic syllable as basic unit for SMT between related languages. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1912–1917.
DOI: Google ScholarCross Ref
- [28] . 2018. MALTP: Parallel prediction of malicious tweets. IEEE Transactions on Computational Social Systems 5, 4 (Dec. 2018), 1096–1108.
DOI: Google ScholarCross Ref
- [29] . 2020. A robust technique of fake news detection using ensemble voting classifier and comparison with other classifiers. SN Applied Sciences 2 (2020), 525.
DOI: Google ScholarCross Ref
- [30] . 2021. Home Page. Retrieved March 31, 2021 from https://www.altnews.in/hindi/.Google Scholar
- [31] . 2021. Home Page. Retrieved March 31, 2021 from https://hindi.asianetnews.com/.Google Scholar
- [32] . 2021. Home Page. Retrieved March 31, 2021 from https://www.vishvasnews.com/.Google Scholar
- [33] . 2020. Hyperparameter Tuning - Brief Theory and What You Won’t Find in the Handbook. Retrieved July 1, 2022 from https://medium.com/analytics-vidhya/hyperparameter-tuning-8ca311b16057.Google Scholar
- [34] . 2016. Tokenizing Indic Strings by Syllables in Python. Retrieved July 1, 2022 from https://pandey.github.io/posts/tokenize-indic-syllables-python.html.Google Scholar
- [35] . 2022. pyiwn: A Python-Based API to Access Indian Language WordNets. Retrieved August 30, 2022 from https://aclanthology.org/2018.gwc-1.47.pdf.Google Scholar
- [36] . 2021. Home Page. Retrieved March 31, 2021 from https://hindi.thequint.com/news/webqoof/.Google Scholar
- [37] . 2017. Facilitating time critical information seeking in social media. IEEE Transactions on Knowledge and Data Engineering 29, 10 (
Oct. 2017), 2197–2209.DOI: Google ScholarDigital Library
- [38] . 2020. A dataset of fact-checked images shared on WhatsApp during the Brazilian and Indian elections. In Proceedings of the 14th International AAAI Conference on Web and Social Media, Vol. 14. 903–908. Google Scholar
Cross Ref
- [39] . 2020. Manipulative Fake News on the Rise in India Under Lockdown Study. Retrieved August 12, 2021 from https://indiaspend.com/manipulative-fake-news-on-the-rise-in-india-under-lockdown-study/.Google Scholar
- [40] . 2020. Measuring the impact of readability features in fake news detection. In Proceedings of the 12th Language Resources and Evaluation Conference. 1404–1413. https://aclanthology.org/2020.lrec-1.176.Google Scholar
- [41] . 2021. India Fake News Problem Fueled by Digital Illiteracy. Retrieved December 8, 2021 from https://www.dw.com/en/india-fake-news-problem-fueled-by-digital-illiteracy/a-56746776.Google Scholar
- [42] . 2019. A large-scale study of the Twitter follower network to characterize the spread of prescription drug abuse tweets. IEEE Transactions on Computational Social Systems 6, 6 (Dec, 2019), 1232–1244.
DOI: Google ScholarCross Ref
- [43] . 2021. Machine learning methods to identify Hindi fake news within social media. In Proceedings of the 12th International Conference on Computing Communication and Networking Technologies (ICCCNT’21). IEEE, Los Alamitos, CA, 1–6.
DOI: Google ScholarCross Ref
- [44] . 2012. New readability measures for Bangla and Hindi texts. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 1141–1150.Google Scholar
- [45] . 2020. Study and detection of fake news: P2C2-based machine learning approach. In Data Management, Analytics and Innovation.Advances in Intelligent Systems and Computing, Vol. 1175. Springer, 261–278.
DOI: Google ScholarCross Ref
- [46] . 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. 18–22.
DOI: Google ScholarCross Ref
- [47] . 2018. The spread of true and false news online. Science 359, 6380 (
mar 2018), 1146–1151.DOI: Google ScholarCross Ref
- [48] . 2017. Liar, liar pants on fire: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 2. Association for Computational Linguistics, 422–426.
DOI: Google ScholarCross Ref
- [49] . 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD’18). Association for Computing Machinery, 849–857.
DOI: Google ScholarDigital Library
- [50] . 2019. A situational analytic method for user behavior pattern in multimedia social networks. IEEE Transactions on Big Data 5, 4 (
Dec. 2019), 520–528.DOI: Google ScholarCross Ref
- [51] . 2020. SAFE: Similarity-aware multi-modal fake news detection. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Vol. 12085. Springer, Cham, 849–857.
DOI: Google ScholarDigital Library
Index Terms
LFWE: Linguistic Feature Based Word Embedding for Hindi Fake News Detection
Recommendations
Albanian Fake News Detection
Recent years have witnessed the vast increase of the phenomenon known as the fake news. Among the main reasons for this increase are the continuous growth of internet and social media usage and the real-time information dissemination opportunity offered ...
Word Sense Based Hindi-Tamil Statistical Machine Translation
Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...
Exploring extensive linguistic feature sets in near-synonym lexical choice
CICLing'12: Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part IIIn the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is selected to fill a lexical gap in a text. We experiment on an approach of an extensive set, over 650, linguistic features to represent the context of a word, ...






Comments