Abstract
This article addresses the class imbalance issue in a low-resource language called Bengali. As a use-case, we choose one of the most fundamental NLP tasks, i.e., text classification, where we utilize three benchmark text corpora: fake-news dataset, sentiment analysis dataset, and song lyrics dataset. Each of them contains a critical class imbalance. We attempt to tackle the problem by applying several strategies that include data augmentation with synthetic samples via text and embedding generation in order to augment the proportion of the minority samples. Moreover, we apply ensembling of deep learning models by subsetting the majority samples. Additionally, we enforce the focal loss function for class-imbalanced data classification. We also apply the outlier detection technique, data resampling, and hidden feature extraction to improve the minority-f1 score. All of our experimentations are entirely focused on textual content analysis, which results in a more than 90% minority f1 score for each of the three tasks. It is an excellent outcome on such highly class-imbalanced datasets.
- . 2015. Outlier analysis. In Data Mining. Springer, 237–263.Google Scholar
- . 2001. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 37–46.Google Scholar
Digital Library
- . 2017. Detection of online fake news using n-gram analysis and machine learning techniques. In International Conference on Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. Springer, 127–138.Google Scholar
Cross Ref
- . 2019. A survey of opinion mining in Arabic: A comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18, 3 (2019), 1–52.Google Scholar
Digital Library
- . 2014. A large scale Arabic sentiment lexicon for Arabic opinion mining. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 165–173.Google Scholar
Cross Ref
- . 2017. CVAE-GAN: Fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision. 2745–2754.Google Scholar
Cross Ref
- . 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, 10–21. Google Scholar
Cross Ref
- . 2015. A survey of predictive modelling under imbalanced distributions. ArXiv abs/1505.01658 (2015).Google Scholar
- . 1996a. Bagging predictors. Machine Learning 24, 2 (
Aug. 1996), 123–140. Google ScholarDigital Library
- . 1996bb. Stacked regressions. Machine Learning 24, 1 (
Jul. 1996), 49–64. Google ScholarCross Ref
- . 2001. Random forests. Machine Learning 45, 1 (
Oct. 2001), 5–32. Google ScholarDigital Library
- . 2018. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018).Google Scholar
- . 2010. “Good” and “bad” diversity in majority vote ensembles. In Multiple Classifier Systems, , , and (Eds.). Springer, Berlin,124–133.Google Scholar
- . 2004. Learning ensembles from bites: A scalable and accurate approach. Journal of Machine Learning Research 5 (
Dec. 2004), 421–451.Google ScholarDigital Library
- . 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). Association for Computing Machinery, New York, NY, 785–794. Google Scholar
Digital Library
- . 2006. Combating imbalance in network intrusion datasets. In 2006 IEEE International Conference on Granular Computing. 732–737.Google Scholar
Cross Ref
- . 1995. Support-vector networks. Machine Learning 20, 3 (
Sep. 1995), 273–297. Google ScholarDigital Library
- . 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1 (2009), 27–38. Google Scholar
Digital Library
- . 1999. A short introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 1401–1406.Google Scholar
- . 1997. Bayesian network classifiers. Machine Learning 29, 2 (
Nov. 1997), 131–163. Google ScholarDigital Library
- . 2012aa. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. Transactions on Systems, Man, and Cybernetics, Part C 42, 4 (
Jul. 2012), 463–484. Google ScholarDigital Library
- . 2012bb. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42 (
2012), 463–484. Google Scholar Digital Library
- . 2019. Predicting next word using RNN and LSTM cells: Statistical language modeling. In 2019 5th International Conference on Image Information Processing (ICIIP’19). 469–474.Google Scholar
Cross Ref
- . 2014. word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.Google Scholar
- . 2017. Fake news detection using naive Bayes classifier. In 2017 IEEE 1st Ukraine Conference on Electrical and Computer Engineering (UKRCON’17). IEEE, 900–903.Google Scholar
Cross Ref
- . 2013. Imbalanced Learning: Foundations, Algorithms, and Applications (1st ed.). Wiley-IEEE Press.Google Scholar
Cross Ref
- . 2018. Weakly supervised learning for fake news detection on Twitter. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). 274–277.Google Scholar
Digital Library
- . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- . 2020. BanFakeNews: A dataset for detecting fake news in Bangla. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 2862–2871. https://www.aclweb.org/anthology/2020.lrec-1.349.Google Scholar
- . 2016. Learning deep representation for imbalanced classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5375–5384.Google Scholar
Cross Ref
- . 2009. Research on Bangla language processing in bangladesh: Progress and challenges. In 8th International Language and Development Conference. 23–25.Google Scholar
- . 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis 6, 5 (
Oct. 2002), 429–449.Google ScholarDigital Library
- . 2019. Survey on deep learning with class imbalance. Journal of Big Data 6, 1 (
Mar. 2019), 27. Google ScholarCross Ref
- . 2013. Technical Challenges and Design Issues in Bangla Language Processing. IGI Global.Google Scholar
Digital Library
- . 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, , , , , , , and (Eds.). Curran Associates, Inc., 3146–3154. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf.Google Scholar
- . 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1746–1751. Google Scholar
Cross Ref
- . 2006. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30, 1 (2006), 25–36.Google Scholar
- . 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.Google Scholar
Digital Library
- . 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 302–308.Google Scholar
Cross Ref
- . 2018. Text-to-text generative adversarial networks. In 2018 International Joint Conference on Neural Networks (IJCNN’18). IEEE, 1–7.Google Scholar
Cross Ref
- . 2010. A learning method for the class imbalance problem with medical data sets. Computers in Biology and Medicine 40, 5 (
May 2010), 509–518. Google ScholarDigital Library
- . 2014. Addressing class imbalance for improved recognition of implicit discourse relations. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL’14). Association for Computational Linguistics, Philadelphia, PA, 142–150. Google Scholar
Cross Ref
- . 2012. Active learning for imbalanced sentiment classification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 139–148.Google Scholar
Digital Library
- . 2017. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV’17), 2999–3007.Google Scholar
Cross Ref
- . 2009. Exploratory undersampling for class-imbalance learning. Transactions on Systems, Man, and Cybernetics Part B 39, 2 (
Apr. 2009), 539–550. Google ScholarDigital Library
- . 2011. Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE, 5528–5531.Google Scholar
Cross Ref
- . 2019. Fake News Detection on Social Media using Geometric Deep Learning.
arxiv:cs.SI/1902.06673 .Google Scholar - . 2009. Ensemble approach for the classification of imbalanced data. In Advances in Artificial Intelligence (AI’09), and (Eds.). Springer, Berlin,291–300.Google Scholar
- . 2020. Imbalance problems in object detection: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 10 (2020), 3388–3415.Google Scholar
- . 2002. Thumbs up? Sentiment classification using machine learning techniques. arXiv preprint cs/0205070 (2002).Google Scholar
- . 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1532–1543. Google Scholar
Cross Ref
- . 2018. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31, , , , , , and (Eds.). Curran Associates, Inc., 6638–6648. http://papers.nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features.pdf.Google Scholar
- . 2016. Predicting early psychiatric readmission with natural language processing of narrative discharge summaries. Translational Psychiatry 6, 10 (2016), e921–e921.Google Scholar
Cross Ref
- . 2020. Cross-lingual sentiment classification in low-resource Bengali language. In Proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT’20). 50–60.Google Scholar
Cross Ref
- . 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- . 2010. RUSBoost: A hybrid approach to alleviating class imbalance. Transactions on Systems, Man, and Cybernetics, Part A 40, 1 (
Jan. 2010), 185–197. Google ScholarDigital Library
- . 2017. A hybrid convolutional variational autoencoder for text generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 627–637. Google Scholar
Cross Ref
- . 2019. DEFEND: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19). Association for Computing Machinery, New York, NY, 395–405. Google Scholar
Digital Library
- . 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19, 1 (2017), 22–36.Google Scholar
Digital Library
- . 2015. A novel ensemble method for classifying imbalanced data. Pattern Recognition 48, 5 (2015), 1623–1637. Google Scholar
Digital Library
- . 2014. Learning sentiment-specific word embedding for Twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1555–1565.Google Scholar
Cross Ref
- . 2018. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM International Conference on Multimedia. 1598–1606.Google Scholar
Digital Library
- . 2009. Reducing class imbalance during active learning for named entity annotation. In Proceedings of the 5th International Conference on Knowledge Capture. 105–112.Google Scholar
Digital Library
- . 2018. Dataset optimization for real-time pedestrian detection. IEEE Access 6 (2018), 7719–7727.Google Scholar
Cross Ref
- . 2016. Training deep neural networks on imbalanced data sets. In 2016 International Joint Conference on Neural Networks (IJCNN’16). 4368–4374.Google Scholar
Cross Ref
- . 2018a. Adult image classification by a local-context aware network. In 2018 25th IEEE International Conference on Image Processing (ICIP’18). IEEE, 2989–2993.Google Scholar
Cross Ref
- . 2018b. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and amp; Data Mining (KDD’18). Association for Computing Machinery, New York, NY, 849–857. Google Scholar
Digital Library
- . 2018. Deep learning for vehicle detection in aerial images. In 2018 25th IEEE International Conference on Image Processing (ICIP’18). IEEE, 3079–3083.Google Scholar
Cross Ref
- . 2019. Fake news: Fundamental theories, detection strategies and challenges. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 836–837.Google Scholar
Digital Library
- . 2006. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18, 1 (
Jan. 2006), 63–77. Google ScholarDigital Library
Index Terms
Breaking the Curse of Class Imbalance: Bangla Text Classification
Recommendations
Resampling algorithms based on sample concatenation for imbalance learning
AbstractResampling is the widely used method for imbalance learning. Most existing resampling methods use various techniques in the original sample space to rebalance imbalanced datasets, but they may cause loss of valuable information or ...
Highlights- Sample concatenation is introduced and analyzed for imbalance learning.
- Two ...
A Combination of Resampling and Ensemble Method for Text Classification on Imbalanced Data
Big Data – BigData 2021AbstractOne of the major factor which can affect the accuracy of text classification is the imbalanced dataset. In order to find the suitable method to handle this issue, six different ensemble methods are used to train models on imbalanced dataset. The ...
Class imbalance and the curse of minority hubs
Most machine learning tasks involve learning from high-dimensional data, which is often quite difficult to handle. Hubness is an aspect of the curse of dimensionality that was shown to be highly detrimental to k-nearest neighbor methods in high-...






Comments