skip to main content
research-article

Breaking the Curse of Class Imbalance: Bangla Text Classification

Authors Info & Claims
Published:29 April 2022Publication History
Skip Abstract Section

Abstract

This article addresses the class imbalance issue in a low-resource language called Bengali. As a use-case, we choose one of the most fundamental NLP tasks, i.e., text classification, where we utilize three benchmark text corpora: fake-news dataset, sentiment analysis dataset, and song lyrics dataset. Each of them contains a critical class imbalance. We attempt to tackle the problem by applying several strategies that include data augmentation with synthetic samples via text and embedding generation in order to augment the proportion of the minority samples. Moreover, we apply ensembling of deep learning models by subsetting the majority samples. Additionally, we enforce the focal loss function for class-imbalanced data classification. We also apply the outlier detection technique, data resampling, and hidden feature extraction to improve the minority-f1 score. All of our experimentations are entirely focused on textual content analysis, which results in a more than 90% minority f1 score for each of the three tasks. It is an excellent outcome on such highly class-imbalanced datasets.

REFERENCES

  1. Aggarwal Charu C.. 2015. Outlier analysis. In Data Mining. Springer, 237263.Google ScholarGoogle Scholar
  2. Aggarwal Charu C. and Yu Philip S.. 2001. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 3746.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ahmed Hadeer, Traore Issa, and Saad Sherif. 2017. Detection of online fake news using n-gram analysis and machine learning techniques. In International Conference on Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. Springer, 127138.Google ScholarGoogle ScholarCross RefCross Ref
  4. Badaro Gilbert, Baly Ramy, Hajj Hazem, El-Hajj Wassim, Shaban Khaled Bashir, Habash Nizar, Al-Sallab Ahmad, and Hamdi Ali. 2019. A survey of opinion mining in Arabic: A comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18, 3 (2019), 152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Badaro Gilbert, Baly Ramy, Hajj Hazem, Habash Nizar, and El-Hajj Wassim. 2014. A large scale Arabic sentiment lexicon for Arabic opinion mining. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 165173.Google ScholarGoogle ScholarCross RefCross Ref
  6. Bao Jianmin, Chen Dong, Wen Fang, Li Houqiang, and Hua Gang. 2017. CVAE-GAN: Fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision. 27452754.Google ScholarGoogle ScholarCross RefCross Ref
  7. Bowman Samuel R., Vilnis Luke, Vinyals Oriol, Dai Andrew, Jozefowicz Rafal, and Bengio Samy. 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, 1021. Google ScholarGoogle ScholarCross RefCross Ref
  8. Branco Paula, Torgo L., and Ribeiro Rita P.. 2015. A survey of predictive modelling under imbalanced distributions. ArXiv abs/1505.01658 (2015).Google ScholarGoogle Scholar
  9. Breiman Leo. 1996a. Bagging predictors. Machine Learning 24, 2 (Aug. 1996), 123140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Breiman Leo. 1996bb. Stacked regressions. Machine Learning 24, 1 (Jul. 1996), 4964. Google ScholarGoogle ScholarCross RefCross Ref
  11. Breiman Leo. 2001. Random forests. Machine Learning 45, 1 (Oct. 2001), 532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brock Andrew, Donahue Jeff, and Simonyan Karen. 2018. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018).Google ScholarGoogle Scholar
  13. Brown Gavin and Kuncheva Ludmila I.. 2010. “Good” and “bad” diversity in majority vote ensembles. In Multiple Classifier Systems, Gayar Neamat El, Kittler Josef, and Roli Fabio (Eds.). Springer, Berlin,124133.Google ScholarGoogle Scholar
  14. Chawla Nitesh V., Hall Lawrence O., Bowyer Kevin W., and Kegelmeyer W. Philip. 2004. Learning ensembles from bites: A scalable and accurate approach. Journal of Machine Learning Research 5 (Dec. 2004), 421451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen Tianqi and Guestrin Carlos. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). Association for Computing Machinery, New York, NY, 785794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cieslak D. A., Chawla N. V., and Striegel A.. 2006. Combating imbalance in network intrusion datasets. In 2006 IEEE International Conference on Granular Computing. 732737.Google ScholarGoogle ScholarCross RefCross Ref
  17. Cortes Corinna and Vapnik Vladimir. 1995. Support-vector networks. Machine Learning 20, 3 (Sep. 1995), 273297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ferri C., Hernández-Orallo J., and Modroiu R.. 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1 (2009), 2738. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Freund Yoav and Schapire Robert E.. 1999. A short introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 14011406.Google ScholarGoogle Scholar
  20. Friedman Nir, Geiger Dan, and Goldszmidt Moises. 1997. Bayesian network classifiers. Machine Learning 29, 2 (Nov. 1997), 131163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Galar Mikel, Fernandez Alberto, Barrenechea Edurne, Bustince Humberto, and Herrera Francisco. 2012aa. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. Transactions on Systems, Man, and Cybernetics, Part C 42, 4 (Jul. 2012), 463484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Galar Mikel, Fernández Alberto, Barrenechea Edurne, Sola Humberto, and Herrera Francisco. 2012bb. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42 (2012), 463484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ganai A. F. and Khursheed F.. 2019. Predicting next word using RNN and LSTM cells: Statistical language modeling. In 2019 5th International Conference on Image Information Processing (ICIIP’19). 469474.Google ScholarGoogle ScholarCross RefCross Ref
  24. Goldberg Yoav and Levy Omer. 2014. word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.Google ScholarGoogle Scholar
  25. Granik Mykhailo and Mesyura Volodymyr. 2017. Fake news detection using naive Bayes classifier. In 2017 IEEE 1st Ukraine Conference on Electrical and Computer Engineering (UKRCON’17). IEEE, 900903.Google ScholarGoogle ScholarCross RefCross Ref
  26. He Haibo and Ma Yunqian. 2013. Imbalanced Learning: Foundations, Algorithms, and Applications (1st ed.). Wiley-IEEE Press.Google ScholarGoogle ScholarCross RefCross Ref
  27. Helmstetter S. and Paulheim H.. 2018. Weakly supervised learning for fake news detection on Twitter. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). 274277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hossain Md Zobaer, Rahman Md Ashraful, Islam Md Saiful, and Kar Sudipta. 2020. BanFakeNews: A dataset for detecting fake news in Bangla. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 28622871. https://www.aclweb.org/anthology/2020.lrec-1.349.Google ScholarGoogle Scholar
  30. Huang C., Li Y., Loy C. C., and Tang X.. 2016. Learning deep representation for imbalanced classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 53755384.Google ScholarGoogle ScholarCross RefCross Ref
  31. Islam Minhajul. 2009. Research on Bangla language processing in bangladesh: Progress and challenges. In 8th International Language and Development Conference. 23–25.Google ScholarGoogle Scholar
  32. Japkowicz Nathalie and Stephen Shaju. 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis 6, 5 (Oct. 2002), 429449.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Johnson Justin M. and Khoshgoftaar Taghi M.. 2019. Survey on deep learning with class imbalance. Journal of Big Data 6, 1 (Mar. 2019), 27. Google ScholarGoogle ScholarCross RefCross Ref
  34. Karim Mohammad, Kaykobad Mohammad, and Murshed M.. 2013. Technical Challenges and Design Issues in Bangla Language Processing. IGI Global.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ke Guolin, Meng Qi, Finley Thomas, Wang Taifeng, Chen Wei, Ma Weidong, Ye Qiwei, and Liu Tie-Yan. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., and Garnett R. (Eds.). Curran Associates, Inc., 31463154. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf.Google ScholarGoogle Scholar
  36. Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 17461751. Google ScholarGoogle ScholarCross RefCross Ref
  37. Kotsiantis S., Kanellopoulos D., and Pintelas P.. 2006. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30, 1 (2006), 25–36.Google ScholarGoogle Scholar
  38. Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 10971105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Levy Omer and Goldberg Yoav. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 302308.Google ScholarGoogle ScholarCross RefCross Ref
  40. Li Changliang, Su Yixin, and Liu Wenju. 2018. Text-to-text generative adversarial networks. In 2018 International Joint Conference on Neural Networks (IJCNN’18). IEEE, 17.Google ScholarGoogle ScholarCross RefCross Ref
  41. Li Der-Chiang, Liu Chiao-Wen, and Hu Susan C.. 2010. A learning method for the class imbalance problem with medical data sets. Computers in Biology and Medicine 40, 5 (May 2010), 509518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Li Junyi Jessy and Nenkova Ani. 2014. Addressing class imbalance for improved recognition of implicit discourse relations. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL’14). Association for Computational Linguistics, Philadelphia, PA, 142150. Google ScholarGoogle ScholarCross RefCross Ref
  43. Li Shoushan, Ju Shengfeng, Zhou Guodong, and Lin Xiaojun. 2012. Active learning for imbalanced sentiment classification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 139148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Lin Tsung-Yi, Goyal Priya, Girshick Ross B., He Kaiming, and Dollár P.. 2017. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV’17), 29993007.Google ScholarGoogle ScholarCross RefCross Ref
  45. Liu Xu-Ying, Wu Jianxin, and Zhou Zhi-Hua. 2009. Exploratory undersampling for class-imbalance learning. Transactions on Systems, Man, and Cybernetics Part B 39, 2 (Apr. 2009), 539550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mikolov Tomáš, Kombrink Stefan, Burget Lukáš, Černockỳ Jan, and Khudanpur Sanjeev. 2011. Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE, 55285531.Google ScholarGoogle ScholarCross RefCross Ref
  47. Monti Federico, Frasca Fabrizio, Eynard Davide, Mannion Damon, and Bronstein Michael M.. 2019. Fake News Detection on Social Media using Geometric Deep Learning. arxiv:cs.SI/1902.06673.Google ScholarGoogle Scholar
  48. Nikulin Vladimir, McLachlan Geoffrey J., and Ng Shu Kay. 2009. Ensemble approach for the classification of imbalanced data. In Advances in Artificial Intelligence (AI’09), Nicholson Ann and Li Xiaodong (Eds.). Springer, Berlin,291300.Google ScholarGoogle Scholar
  49. Oksuz Kemal, Cam Baris Can, Kalkan Sinan, and Akbas Emre. 2020. Imbalance problems in object detection: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 10 (2020), 3388–3415.Google ScholarGoogle Scholar
  50. Pang Bo, Lee Lillian, and Vaithyanathan Shivakumar. 2002. Thumbs up? Sentiment classification using machine learning techniques. arXiv preprint cs/0205070 (2002).Google ScholarGoogle Scholar
  51. Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 15321543. Google ScholarGoogle ScholarCross RefCross Ref
  52. Prokhorenkova Liudmila, Gusev Gleb, Vorobev Aleksandr, Dorogush Anna Veronika, and Gulin Andrey. 2018. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31, Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., and Garnett R. (Eds.). Curran Associates, Inc., 66386648. http://papers.nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features.pdf.Google ScholarGoogle Scholar
  53. Rumshisky Anna, Ghassemi Marzyeh, Naumann Tristan, Szolovits Peter, Castro V. M., McCoy T. H., and Perlis R. H.. 2016. Predicting early psychiatric readmission with natural language processing of narrative discharge summaries. Translational Psychiatry 6, 10 (2016), e921–e921.Google ScholarGoogle ScholarCross RefCross Ref
  54. Sazzed Salim. 2020. Cross-lingual sentiment classification in low-resource Bengali language. In Proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT’20). 5060.Google ScholarGoogle ScholarCross RefCross Ref
  55. Schuster Mike and Paliwal Kuldip K.. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 26732681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Seiffert C., Khoshgoftaar T. M., Hulse J. Van, and Napolitano A.. 2010. RUSBoost: A hybrid approach to alleviating class imbalance. Transactions on Systems, Man, and Cybernetics, Part A 40, 1 (Jan. 2010), 185197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Semeniuta Stanislau, Severyn Aliaksei, and Barth Erhardt. 2017. A hybrid convolutional variational autoencoder for text generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 627637. Google ScholarGoogle ScholarCross RefCross Ref
  58. Shu Kai, Cui Limeng, Wang Suhang, Lee Dongwon, and Liu Huan. 2019. DEFEND: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19). Association for Computing Machinery, New York, NY, 395405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Shu Kai, Sliva Amy, Wang Suhang, Tang Jiliang, and Liu Huan. 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19, 1 (2017), 2236.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Sun Zhongbin, Song Qinbao, Zhu Xiaoyan, Sun Heli, Xu Baowen, and Zhou Yuming. 2015. A novel ensemble method for classifying imbalanced data. Pattern Recognition 48, 5 (2015), 16231637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Tang Duyu, Wei Furu, Yang Nan, Zhou Ming, Liu Ting, and Qin Bing. 2014. Learning sentiment-specific word embedding for Twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15551565.Google ScholarGoogle ScholarCross RefCross Ref
  62. Tang Taoran, Jia Jia, and Mao Hanyang. 2018. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM International Conference on Multimedia. 15981606.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Tomanek Katrin and Hahn Udo. 2009. Reducing class imbalance during active learning for named entity annotation. In Proceedings of the 5th International Conference on Knowledge Capture. 105112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Trichet Remi and Bremond Francois. 2018. Dataset optimization for real-time pedestrian detection. IEEE Access 6 (2018), 77197727.Google ScholarGoogle ScholarCross RefCross Ref
  65. Wang S., Liu W., Wu J., Cao L., Meng Q., and Kennedy P. J.. 2016. Training deep neural networks on imbalanced data sets. In 2016 International Joint Conference on Neural Networks (IJCNN’16). 43684374.Google ScholarGoogle ScholarCross RefCross Ref
  66. Wang Xizi, Cheng Feng, Wang Shilin, Sun Huanrong, Liu Gongshen, and Zhou Cheng. 2018a. Adult image classification by a local-context aware network. In 2018 25th IEEE International Conference on Image Processing (ICIP’18). IEEE, 29892993.Google ScholarGoogle ScholarCross RefCross Ref
  67. Wang Yaqing, Ma Fenglong, Jin Zhiwei, Yuan Ye, Xun Guangxu, Jha Kishlay, Su Lu, and Gao Jing. 2018b. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and amp; Data Mining (KDD’18). Association for Computing Machinery, New York, NY, 849857. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Yang Michael Ying, Liao Wentong, Li Xinbo, and Rosenhahn Bodo. 2018. Deep learning for vehicle detection in aerial images. In 2018 25th IEEE International Conference on Image Processing (ICIP’18). IEEE, 30793083.Google ScholarGoogle ScholarCross RefCross Ref
  69. Zhou Xinyi, Zafarani Reza, Shu Kai, and Liu Huan. 2019. Fake news: Fundamental theories, detection strategies and challenges. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 836837.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Zhou Zhi-Hua and Liu Xu-Ying. 2006. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18, 1 (Jan. 2006), 6377. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Breaking the Curse of Class Imbalance: Bangla Text Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 5
      September 2022
      486 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3533669
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 April 2022
      • Online AM: 3 February 2022
      • Accepted: 1 January 2022
      • Revised: 1 November 2021
      • Received: 1 May 2021
      Published in tallip Volume 21, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!