Abstract
As the number of digital text documents increases on a daily basis, the classification of text is becoming a challenging task. Each text document consists of a large number of words (or features) that drive down the efficiency of a classification algorithm. This article presents an optimized feature selection algorithm designed to reduce a large number of features to improve the accuracy of the text classification algorithm. The proposed algorithm uses noun-based filtering, a word ranking that enhances the performance of the text classification algorithm. Experiments are carried out on three benchmark datasets, and the results show that the proposed classification algorithm has achieved the maximum accuracy when compared to the existing algorithms. The proposed algorithm is compared to Term Frequency-Inverse Document Frequency, Balanced Accuracy Measure, GINI Index, Information Gain, and Chi-Square. The experimental results clearly show the strength of the proposed algorithm.
- Laith Mohammad Abualigah, Ahamad Tajudin Khader, Mohammed Azmi Al-Betar, and Osama Ahmad Alomari. 2017. Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst. Appl. 84 (2017), 24–36.DOI:https://doi.org/10.1016/j.eswa.2017.05.002 Google Scholar
Digital Library
- Deepak Agnihotri, Kesari Verma, and Priyanka Tripathi. 2017. Variable global feature selection scheme for automatic classification of text documents. Expert Syst. Appl. 81 (2017), 268–281. DOI:https://doi.org/10.1016/j.eswa.2017.03.057 Google Scholar
Digital Library
- H. V. Agun and O. Yilmazel. 2019. Incorporating topic information in a global feature selection schema for authorship attribution. IEEE Access 7 (2019), 98522–98529. DOI:https://doi.org/10.1109/ACCESS.2019.2930536Google Scholar
Cross Ref
- Ion Androutsopoulos. [n.d.]. Spam Dataset. Retrieved from http://www2.aueb.gr/users/ion/data/enron-spam.Google Scholar
- Muhammad Zubair Asghar, Fazli Subhan, Hussain Ahmad, Wazir Zada Khan, Saqib Hakak, Thippa Reddy Gadekallu, and Mamoun Alazab. 2020. Senti-eSystem: A sentiment-based eSystem-using hybridized fuzzy and deep neural network for measuring customer satisfaction. Softw.: Pract. Exp. (2020).Google Scholar
- A. S. Ashour, M. K. A. Nour, K. Polat, Y. Guo, W. Alsaggaf, and A. El-Attar. 2020. A novel framework of two successive feature selection levels using weight-based procedure for voice-loss detection in Parkinson’s disease. IEEE Access 8 (2020), 76193–76203.Google Scholar
Cross Ref
- Nouman Azam and JingTao Yao. 2012. Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst. Appl. 39, 5 (2012), 4760–4768. DOI:https://doi.org/10.1016/j.eswa.2011.09.160 Google Scholar
Digital Library
- Andrea Bommert, Xudong Sun, Bernd Bischl, Rahnen, and Michel Lang. 2020. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 143 (2020), 106839. DOI:https://doi.org/10.1016/j.csda.2019.106839Google Scholar
Digital Library
- Qian Chen, Gautam Srivastava, Reza M. Parizi, Moayad Aloqaily, and Ismaeel Al Ridhawi. 2020. An incentive-aware blockchain-based solution for internet of fake media things. Inf. Process. Manage. (2020), 102370.Google Scholar
- Chris Crawford. 2017. 20 Newsgroups. Retrieved from https://www.kaggle.com/crawford/20-newsgroups.Google Scholar
- NLTK Data. 2017. Reuters. Retrieved from https://www.kaggle.com/nltkdata/reuters.Google Scholar
- Z. Deng, F. Chung, and S. Wang. 2010. Robust relief-feature weighting, margin maximization, and fuzzy optimization. IEEE Trans. Fuzzy Syst. 18, 4 (Aug. 2010), 726–744. Google Scholar
Digital Library
- George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 (Mar. 2003), 1289–1305. Google Scholar
Digital Library
- ThippaReddy Gadekallu, Akshat Soni, Deeptanu Sarkar, and Lakshmanna Kuruva. 2019. Application of sentiment analysis in movie reviews. In Sentiment Analysis and Knowledge Discovery in Contemporary Business. IGI Global, 77–90.Google Scholar
- Abdullah Saeed Ghareb, Azuraliza Abu Bakar, and Abdul Razak Hamdan. 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49 (2016), 31–47. DOI:https://doi.org/10.1016/j.eswa.2015.12.004 Google Scholar
Digital Library
- Wenjun Hu, Kup-Sze Choi, Yonggen Gu, and Shitong Wang. 2013. Minimum-maximum local structure information for feature selection. Pattern Recogn. Lett. 34, 5 (2013), 527–535. DOI:https://doi.org/10.1016/j.patrec.2012.11.012 Google Scholar
Digital Library
- Rui Huang, Weidong Jiang, and Guangling Sun. 2018. Manifold-based constraint Laplacian score for multi-label feature selection. Pattern Recogn. Lett. 112 (2018), 346–352. DOI:https://doi.org/10.1016/j.patrec.2018.08.021Google Scholar
Cross Ref
- X. Ji, H. Shen, A. Ritter, R. Machiraju, and P. Yen. 2019. Visual exploration of neural document embedding in information retrieval: Semantics and feature selection. IEEE Trans. Vis. Comput. Graph. 25, 6 (June 2019), 2181–2192. DOI:https://doi.org/10.1109/TVCG.2019.2903946Google Scholar
Cross Ref
- Y. Jiang, X. Liu, G. Yan, and J. Xiao. 2017. Modified binary cuckoo search for feature selection: A hybrid filter-wrapper approach. In Proceedings of the 2017 13th International Conference on Computational Intelligence and Security (CIS’17). 488–491. DOI:https://doi.org/10.1109/CIS.2017.00113Google Scholar
- A. Jovi, K. Brki, and N. Bogunovi. 2015. A review of feature selection methods with applications. In Proceedings of the 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO’15). 1200–1205. DOI:https://doi.org/10.1109/MIPRO.2015.7160458Google Scholar
- Mariam Kalakech, Philippe Biela, Ludovic Macaire, and Denis Hamad. 2011. Constraint scores for semi-supervised feature selection: A comparative study. Pattern Recogn. Lett. 32, 5 (2011), 656–665. DOI:https://doi.org/10.1016/j.patrec.2010.12.014 Google Scholar
Digital Library
- Fatemeh Zarisfi Kermani, Esfandiar Eslami, and Faramarz Sadeghi. 2019. Global filter wrapper method based on class-dependent correlation for text classification. Eng. Appl. Artif. Intell. 85 (2019), 619–633. DOI:https://doi.org/10.1016/j.engappai.2019.07.003Google Scholar
Cross Ref
- Kyoungok Kim and See Young Zzang. 2019. Trigonometric comparison measure: A feature selection method for text categorization. Data Knowl. Eng. 119 (2019), 1–21. DOI:https://doi.org/10.1016/j.datak.2018.10.003Google Scholar
Cross Ref
- S. Kim and J. Park. 2018. Hybrid feature selection method based on neural networks and cross-validation for liver cancer with microarray. IEEE Access 6 (2018), 78214–78224. DOI:https://doi.org/10.1109/ACCESS.2018.2884896Google Scholar
Cross Ref
- Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artif. Intell. 97, 1 (1997), 273–324. DOI:https://doi.org/10.1016/S0004-3702(97)00043-XRelevance. Google Scholar
Digital Library
- Mahdieh Labani, Parham Moradi, and Mahdi Jalili. 2020. A multi-objective genetic algorithm for text feature selection using the relative discriminative criterion. Expert Syst. Appl. 149 (2020), 113276. DOI:https://doi.org/10.1016/j.eswa.2020.113276Google Scholar
Cross Ref
- Lan lan Chen, Ao Zhang, and Xiao guang Lou. 2019. Cross-subject driver status detection from physiological signals based on hybrid feature selection and transfer learning. Expert Syst. Appl. 137 (2019), 266–280. DOI:https://doi.org/10.1016/j.eswa.2019.02.005Google Scholar
Digital Library
- Changki Lee and Gary Geunbae Lee. 2006. Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manage. 42, 1 (2006), 155–165. DOI:https://doi.org/10.1016/j.ipm.2004.08.006Formal Methods for Information Retrieval. Google Scholar
Digital Library
- Y. Li, C. Luo, and S. M. Chung. 2008. Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20, 5 (2008), 641–652. Google Scholar
Digital Library
- Y. Li, C. Luo, and S. M. Chung. 2008. Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20, 5 (May 2008), 641–652. DOI:https://doi.org/10.1109/TKDE.2007.190740 Google Scholar
Digital Library
- Praveen Kumar Reddy Maddikunta, Thippa Reddy Gadekallu, Abdulrahman Al-Ahmari, Mustufa Haider Abidi, et al. 2020. Location based business recommendation using spatial demand. Sustainability 12, 10 (2020).Google Scholar
- H. Nassuna, O. S. Eyobu, J. Kim, and D. Lee. 2019. Feature selection based on variance distribution of power spectral density for driving behavior recognition. In Proceedings of the 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA’19). 335–338. DOI:https://doi.org/10.1109/ICIEA.2019.8834349Google Scholar
Cross Ref
- M. Parimala, R. M. Swarna Priya, M. Praveen Kumar Reddy, Chiranji Lal Chowdhary, Ravi Kumar Poluru, and Suleman Khan. [n.d.]. Spatiotemporal-based sentiment analysis on tweets for risk assessment of event using deep learning approach. Softw.: Pract. Exp.Google Scholar
- Fei Peng, Die lan Zhou, Min Long, and Xing ming Sun. 2017. Discrimination of natural images and computer generated graphics based on multi-fractal and regression analysis. Int. J. Electr. Commun. 71 (2017), 72–81. DOI:https://doi.org/10.1016/j.aeue.2016.11.009Google Scholar
Cross Ref
- J. Rashid, S. M. Adnan Shah, A. Irtaza, T. Mahmood, M. W. Nisar, M. Shafiq, and A. Gardezi. 2019. Topic modeling technique for text mining over biomedical text corpora through hybrid inverse documents frequency and fuzzy k-means clustering. IEEE Access 7 (2019), 146070–146080. DOI:https://doi.org/10.1109/ACCESS.2019.2944973Google Scholar
Cross Ref
- G. Thippa Reddy, M. Praveen Kumar Reddy, Kuruva Lakshmanna, Rajesh Kaluri, Dharmendra Singh Rajput, Gautam Srivastava, and Thar Baker. 2020. Analysis of dimensionality reduction techniques on big data. IEEE Access 8 (2020), 54776–54788.Google Scholar
Cross Ref
- Abdur Rehman, Kashif Javed, and Haroon A. Babri. 2017. Feature selection based on a normalized difference measure for text classification. Inf. Process. Manage. 53, 2 (2017), 473–489. DOI:https://doi.org/10.1016/j.ipm.2016.12.004 Google Scholar
Digital Library
- Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar. 2002), 1–47. DOI:https://doi.org/10.1145/505282.505283 Google Scholar
Digital Library
- Changxing Shang, Min Li, Shengzhong Feng, Qingshan Jiang, and Jianping Fan. 2013. Feature selection via maximizing global information gain for text classification. Knowl.-Based Syst. 54 (2013), 298–309. DOI:https://doi.org/10.1016/j.knosys.2013.09.019Google Scholar
Digital Library
- N. K. Suchetha, A. Nikhil, and P. Hrudya. 2019. Comparing the wrapper feature selection evaluators on Twitter sentiment classification. In Proceedings of the 2019 International Conference on Computational Intelligence in Data Science (ICCIDS’19). 1–6. DOI:https://doi.org/10.1109/ICCIDS.2019.8862033Google Scholar
- B. Tang, S. Kay, and H. He. 2016. Toward optimal feature selection in naive bayes for text categorization. IEEE Trans. Knowl. Data Eng. 28, 9 (Sep. 2016), 2508–2521. DOI:https://doi.org/10.1109/TKDE.2016.2563436 Google Scholar
Digital Library
- A. K. Uysal. 2018. On two-stage feature selection methods for text classification. IEEE Access 6 (2018), 43233–43251. DOI:https://doi.org/10.1109/ACCESS.2018.2863547Google Scholar
Cross Ref
- A. K. Uysal. 2018. On two-stage feature selection methods for text classification. IEEE Access 6 (2018), 43233–43251.Google Scholar
Cross Ref
- C. Wan, Y. Wang, Y. Liu, J. Ji, and G. Feng. 2019. Composite feature extraction and selection for text classification. IEEE Access 7 (2019), 35208–35219.Google Scholar
Cross Ref
- F. Wang, T. Xu, T. Tang, M. Zhou, and H. Wang. 2017. Bilevel feature extraction-based text mining for fault diagnosis of railway systems. IEEE Trans. Intell. Transport. Syst. 18, 1 (2017), 49–58. Google Scholar
Digital Library
- Lingyun Xiang, Xingming Sun, Gang Luo, and Bin Xia. 2014. Linguistic steganalysis using the features derived from synonym frequency. Multimedia Tools Appl. 71, 3 (Aug. 2014), 1893–1911. DOI:https://doi.org/10.1007/s11042-012-1313-8 Google Scholar
Digital Library
- Daoqiang Zhang, Songcan Chen, and Zhi-Hua Zhou. 2008. Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recogn. 41, 5 (2008), 1440–1451. DOI:https://doi.org/10.1016/j.patcog.2007.10.009 Google Scholar
Digital Library
- Le-Bing Zhang, Fei Peng, Le Qin, and Min Long. 2018. Face spoofing detection based on color texture Markov feature and support vector machine recursive feature elimination. J. Vis. Commun. Image Represent. 51 (2018), 56–69. DOI:https://doi.org/10.1016/j.jvcir.2018.01.001Google Scholar
Cross Ref
- Yong Zhang, Qi Wang, Dun wei Gong, and Xian fang Song. 2019. Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection. Pattern Recogn. 93 (2019), 337–352. DOI:https://doi.org/10.1016/j.patcog.2019.04.020Google Scholar
Digital Library
- H. Zhou, Y. Zhang, H. Liu, and Y. Zhang. 2018. Feature selection based on term frequency reordering of document level. IEEE Access 6 (2018), 51655–51668. DOI:https://doi.org/10.1109/ACCESS.2018.2868844Google Scholar
Cross Ref
Index Terms
A Two-stage Text Feature Selection Algorithm for Improving Text Classification
Recommendations
Feature selection for text classification with Naïve Bayes
As an important preprocessing technology in text classification, feature selection can improve the scalability, efficiency and accuracy of a text classifier. In general, a good feature selection method should consider domain and algorithm ...
High-performing feature selection for text classification
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementThis paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive ...
Effective Text Classification by a Supervised Feature Selection Approach
ICDMW '12: Proceedings of the 2012 IEEE 12th International Conference on Data Mining WorkshopsThe high dimensionality of data is a great challenge for effective text classification. Each document in a document corpus contains many irrelevant and noisy information which eventually reduces the efficiency of text classification. Automatic feature ...






Comments