skip to main content
research-article

A Two-stage Text Feature Selection Algorithm for Improving Text Classification

Authors Info & Claims
Published:05 May 2021Publication History
Skip Abstract Section

Abstract

As the number of digital text documents increases on a daily basis, the classification of text is becoming a challenging task. Each text document consists of a large number of words (or features) that drive down the efficiency of a classification algorithm. This article presents an optimized feature selection algorithm designed to reduce a large number of features to improve the accuracy of the text classification algorithm. The proposed algorithm uses noun-based filtering, a word ranking that enhances the performance of the text classification algorithm. Experiments are carried out on three benchmark datasets, and the results show that the proposed classification algorithm has achieved the maximum accuracy when compared to the existing algorithms. The proposed algorithm is compared to Term Frequency-Inverse Document Frequency, Balanced Accuracy Measure, GINI Index, Information Gain, and Chi-Square. The experimental results clearly show the strength of the proposed algorithm.

References

  1. Laith Mohammad Abualigah, Ahamad Tajudin Khader, Mohammed Azmi Al-Betar, and Osama Ahmad Alomari. 2017. Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst. Appl. 84 (2017), 24–36.DOI:https://doi.org/10.1016/j.eswa.2017.05.002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Deepak Agnihotri, Kesari Verma, and Priyanka Tripathi. 2017. Variable global feature selection scheme for automatic classification of text documents. Expert Syst. Appl. 81 (2017), 268–281. DOI:https://doi.org/10.1016/j.eswa.2017.03.057 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. H. V. Agun and O. Yilmazel. 2019. Incorporating topic information in a global feature selection schema for authorship attribution. IEEE Access 7 (2019), 98522–98529. DOI:https://doi.org/10.1109/ACCESS.2019.2930536Google ScholarGoogle ScholarCross RefCross Ref
  4. Ion Androutsopoulos. [n.d.]. Spam Dataset. Retrieved from http://www2.aueb.gr/users/ion/data/enron-spam.Google ScholarGoogle Scholar
  5. Muhammad Zubair Asghar, Fazli Subhan, Hussain Ahmad, Wazir Zada Khan, Saqib Hakak, Thippa Reddy Gadekallu, and Mamoun Alazab. 2020. Senti-eSystem: A sentiment-based eSystem-using hybridized fuzzy and deep neural network for measuring customer satisfaction. Softw.: Pract. Exp. (2020).Google ScholarGoogle Scholar
  6. A. S. Ashour, M. K. A. Nour, K. Polat, Y. Guo, W. Alsaggaf, and A. El-Attar. 2020. A novel framework of two successive feature selection levels using weight-based procedure for voice-loss detection in Parkinson’s disease. IEEE Access 8 (2020), 76193–76203.Google ScholarGoogle ScholarCross RefCross Ref
  7. Nouman Azam and JingTao Yao. 2012. Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst. Appl. 39, 5 (2012), 4760–4768. DOI:https://doi.org/10.1016/j.eswa.2011.09.160 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrea Bommert, Xudong Sun, Bernd Bischl, Rahnen, and Michel Lang. 2020. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 143 (2020), 106839. DOI:https://doi.org/10.1016/j.csda.2019.106839Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Qian Chen, Gautam Srivastava, Reza M. Parizi, Moayad Aloqaily, and Ismaeel Al Ridhawi. 2020. An incentive-aware blockchain-based solution for internet of fake media things. Inf. Process. Manage. (2020), 102370.Google ScholarGoogle Scholar
  10. Chris Crawford. 2017. 20 Newsgroups. Retrieved from https://www.kaggle.com/crawford/20-newsgroups.Google ScholarGoogle Scholar
  11. NLTK Data. 2017. Reuters. Retrieved from https://www.kaggle.com/nltkdata/reuters.Google ScholarGoogle Scholar
  12. Z. Deng, F. Chung, and S. Wang. 2010. Robust relief-feature weighting, margin maximization, and fuzzy optimization. IEEE Trans. Fuzzy Syst. 18, 4 (Aug. 2010), 726–744. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 (Mar. 2003), 1289–1305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ThippaReddy Gadekallu, Akshat Soni, Deeptanu Sarkar, and Lakshmanna Kuruva. 2019. Application of sentiment analysis in movie reviews. In Sentiment Analysis and Knowledge Discovery in Contemporary Business. IGI Global, 77–90.Google ScholarGoogle Scholar
  15. Abdullah Saeed Ghareb, Azuraliza Abu Bakar, and Abdul Razak Hamdan. 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49 (2016), 31–47. DOI:https://doi.org/10.1016/j.eswa.2015.12.004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Wenjun Hu, Kup-Sze Choi, Yonggen Gu, and Shitong Wang. 2013. Minimum-maximum local structure information for feature selection. Pattern Recogn. Lett. 34, 5 (2013), 527–535. DOI:https://doi.org/10.1016/j.patrec.2012.11.012 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rui Huang, Weidong Jiang, and Guangling Sun. 2018. Manifold-based constraint Laplacian score for multi-label feature selection. Pattern Recogn. Lett. 112 (2018), 346–352. DOI:https://doi.org/10.1016/j.patrec.2018.08.021Google ScholarGoogle ScholarCross RefCross Ref
  18. X. Ji, H. Shen, A. Ritter, R. Machiraju, and P. Yen. 2019. Visual exploration of neural document embedding in information retrieval: Semantics and feature selection. IEEE Trans. Vis. Comput. Graph. 25, 6 (June 2019), 2181–2192. DOI:https://doi.org/10.1109/TVCG.2019.2903946Google ScholarGoogle ScholarCross RefCross Ref
  19. Y. Jiang, X. Liu, G. Yan, and J. Xiao. 2017. Modified binary cuckoo search for feature selection: A hybrid filter-wrapper approach. In Proceedings of the 2017 13th International Conference on Computational Intelligence and Security (CIS’17). 488–491. DOI:https://doi.org/10.1109/CIS.2017.00113Google ScholarGoogle Scholar
  20. A. Jovi, K. Brki, and N. Bogunovi. 2015. A review of feature selection methods with applications. In Proceedings of the 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO’15). 1200–1205. DOI:https://doi.org/10.1109/MIPRO.2015.7160458Google ScholarGoogle Scholar
  21. Mariam Kalakech, Philippe Biela, Ludovic Macaire, and Denis Hamad. 2011. Constraint scores for semi-supervised feature selection: A comparative study. Pattern Recogn. Lett. 32, 5 (2011), 656–665. DOI:https://doi.org/10.1016/j.patrec.2010.12.014 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fatemeh Zarisfi Kermani, Esfandiar Eslami, and Faramarz Sadeghi. 2019. Global filter wrapper method based on class-dependent correlation for text classification. Eng. Appl. Artif. Intell. 85 (2019), 619–633. DOI:https://doi.org/10.1016/j.engappai.2019.07.003Google ScholarGoogle ScholarCross RefCross Ref
  23. Kyoungok Kim and See Young Zzang. 2019. Trigonometric comparison measure: A feature selection method for text categorization. Data Knowl. Eng. 119 (2019), 1–21. DOI:https://doi.org/10.1016/j.datak.2018.10.003Google ScholarGoogle ScholarCross RefCross Ref
  24. S. Kim and J. Park. 2018. Hybrid feature selection method based on neural networks and cross-validation for liver cancer with microarray. IEEE Access 6 (2018), 78214–78224. DOI:https://doi.org/10.1109/ACCESS.2018.2884896Google ScholarGoogle ScholarCross RefCross Ref
  25. Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artif. Intell. 97, 1 (1997), 273–324. DOI:https://doi.org/10.1016/S0004-3702(97)00043-XRelevance. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mahdieh Labani, Parham Moradi, and Mahdi Jalili. 2020. A multi-objective genetic algorithm for text feature selection using the relative discriminative criterion. Expert Syst. Appl. 149 (2020), 113276. DOI:https://doi.org/10.1016/j.eswa.2020.113276Google ScholarGoogle ScholarCross RefCross Ref
  27. Lan lan Chen, Ao Zhang, and Xiao guang Lou. 2019. Cross-subject driver status detection from physiological signals based on hybrid feature selection and transfer learning. Expert Syst. Appl. 137 (2019), 266–280. DOI:https://doi.org/10.1016/j.eswa.2019.02.005Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Changki Lee and Gary Geunbae Lee. 2006. Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manage. 42, 1 (2006), 155–165. DOI:https://doi.org/10.1016/j.ipm.2004.08.006Formal Methods for Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Li, C. Luo, and S. M. Chung. 2008. Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20, 5 (2008), 641–652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Li, C. Luo, and S. M. Chung. 2008. Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20, 5 (May 2008), 641–652. DOI:https://doi.org/10.1109/TKDE.2007.190740 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Praveen Kumar Reddy Maddikunta, Thippa Reddy Gadekallu, Abdulrahman Al-Ahmari, Mustufa Haider Abidi, et al. 2020. Location based business recommendation using spatial demand. Sustainability 12, 10 (2020).Google ScholarGoogle Scholar
  32. H. Nassuna, O. S. Eyobu, J. Kim, and D. Lee. 2019. Feature selection based on variance distribution of power spectral density for driving behavior recognition. In Proceedings of the 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA’19). 335–338. DOI:https://doi.org/10.1109/ICIEA.2019.8834349Google ScholarGoogle ScholarCross RefCross Ref
  33. M. Parimala, R. M. Swarna Priya, M. Praveen Kumar Reddy, Chiranji Lal Chowdhary, Ravi Kumar Poluru, and Suleman Khan. [n.d.]. Spatiotemporal-based sentiment analysis on tweets for risk assessment of event using deep learning approach. Softw.: Pract. Exp.Google ScholarGoogle Scholar
  34. Fei Peng, Die lan Zhou, Min Long, and Xing ming Sun. 2017. Discrimination of natural images and computer generated graphics based on multi-fractal and regression analysis. Int. J. Electr. Commun. 71 (2017), 72–81. DOI:https://doi.org/10.1016/j.aeue.2016.11.009Google ScholarGoogle ScholarCross RefCross Ref
  35. J. Rashid, S. M. Adnan Shah, A. Irtaza, T. Mahmood, M. W. Nisar, M. Shafiq, and A. Gardezi. 2019. Topic modeling technique for text mining over biomedical text corpora through hybrid inverse documents frequency and fuzzy k-means clustering. IEEE Access 7 (2019), 146070–146080. DOI:https://doi.org/10.1109/ACCESS.2019.2944973Google ScholarGoogle ScholarCross RefCross Ref
  36. G. Thippa Reddy, M. Praveen Kumar Reddy, Kuruva Lakshmanna, Rajesh Kaluri, Dharmendra Singh Rajput, Gautam Srivastava, and Thar Baker. 2020. Analysis of dimensionality reduction techniques on big data. IEEE Access 8 (2020), 54776–54788.Google ScholarGoogle ScholarCross RefCross Ref
  37. Abdur Rehman, Kashif Javed, and Haroon A. Babri. 2017. Feature selection based on a normalized difference measure for text classification. Inf. Process. Manage. 53, 2 (2017), 473–489. DOI:https://doi.org/10.1016/j.ipm.2016.12.004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar. 2002), 1–47. DOI:https://doi.org/10.1145/505282.505283 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Changxing Shang, Min Li, Shengzhong Feng, Qingshan Jiang, and Jianping Fan. 2013. Feature selection via maximizing global information gain for text classification. Knowl.-Based Syst. 54 (2013), 298–309. DOI:https://doi.org/10.1016/j.knosys.2013.09.019Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. K. Suchetha, A. Nikhil, and P. Hrudya. 2019. Comparing the wrapper feature selection evaluators on Twitter sentiment classification. In Proceedings of the 2019 International Conference on Computational Intelligence in Data Science (ICCIDS’19). 1–6. DOI:https://doi.org/10.1109/ICCIDS.2019.8862033Google ScholarGoogle Scholar
  41. B. Tang, S. Kay, and H. He. 2016. Toward optimal feature selection in naive bayes for text categorization. IEEE Trans. Knowl. Data Eng. 28, 9 (Sep. 2016), 2508–2521. DOI:https://doi.org/10.1109/TKDE.2016.2563436 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. K. Uysal. 2018. On two-stage feature selection methods for text classification. IEEE Access 6 (2018), 43233–43251. DOI:https://doi.org/10.1109/ACCESS.2018.2863547Google ScholarGoogle ScholarCross RefCross Ref
  43. A. K. Uysal. 2018. On two-stage feature selection methods for text classification. IEEE Access 6 (2018), 43233–43251.Google ScholarGoogle ScholarCross RefCross Ref
  44. C. Wan, Y. Wang, Y. Liu, J. Ji, and G. Feng. 2019. Composite feature extraction and selection for text classification. IEEE Access 7 (2019), 35208–35219.Google ScholarGoogle ScholarCross RefCross Ref
  45. F. Wang, T. Xu, T. Tang, M. Zhou, and H. Wang. 2017. Bilevel feature extraction-based text mining for fault diagnosis of railway systems. IEEE Trans. Intell. Transport. Syst. 18, 1 (2017), 49–58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lingyun Xiang, Xingming Sun, Gang Luo, and Bin Xia. 2014. Linguistic steganalysis using the features derived from synonym frequency. Multimedia Tools Appl. 71, 3 (Aug. 2014), 1893–1911. DOI:https://doi.org/10.1007/s11042-012-1313-8 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Daoqiang Zhang, Songcan Chen, and Zhi-Hua Zhou. 2008. Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recogn. 41, 5 (2008), 1440–1451. DOI:https://doi.org/10.1016/j.patcog.2007.10.009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Le-Bing Zhang, Fei Peng, Le Qin, and Min Long. 2018. Face spoofing detection based on color texture Markov feature and support vector machine recursive feature elimination. J. Vis. Commun. Image Represent. 51 (2018), 56–69. DOI:https://doi.org/10.1016/j.jvcir.2018.01.001Google ScholarGoogle ScholarCross RefCross Ref
  49. Yong Zhang, Qi Wang, Dun wei Gong, and Xian fang Song. 2019. Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection. Pattern Recogn. 93 (2019), 337–352. DOI:https://doi.org/10.1016/j.patcog.2019.04.020Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. H. Zhou, Y. Zhang, H. Liu, and Y. Zhang. 2018. Feature selection based on term frequency reordering of document level. IEEE Access 6 (2018), 51655–51668. DOI:https://doi.org/10.1109/ACCESS.2018.2868844Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Two-stage Text Feature Selection Algorithm for Improving Text Classification

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!