Abstract
Technology has evolved remarkably, which has led to an exponential increase in the availability of digital text documents of disparate domains over the Internet. This makes the retrieval of the information a very much time- and resource-consuming task. Thus, a system that can categorize such documents based on their domains can truly help the users in obtaining the required information with relative ease and also reduce the workload of the search engines. This article presents a text categorization system (CESS) that categorizes text document using newly proposed hybrid features that combines term frequency-inverse document frequency-inverse class frequency and modified chi-square methods. Experiments were performed on real-world Bangla documents from eight domains comprises of 24,29,857 tokens, and the highest accuracy of 99.91% has been obtained with multilayer perceptron-based classification. Also, the experiments were tested on Reuters-21578 and 20 Newsgroups datasets and obtained accuracies of 97.29% and 94.67%, respectively, to show the language-independent nature of the system.
- Diab Abuaiadah. 2016. Using bisect K-means clustering technique in the analysis of arabic documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 3 (2016), 1--13.Google Scholar
Digital Library
- Md Tanvir Alam and Md Mofijul Islam. 2018. BARD: Bangla article classification using a new comprehensive dataset. In Proceedings of the International Conference on Bangla Speech and Language Processing. 1--5.Google Scholar
- Abbas Raza Ali and Maliha Ijaz. 2009. Urdu text classification. In Proceedings of the International Conference on Frontiers of Information Technology (ICFIT’09). 21:1--21:7.Google Scholar
Digital Library
- K. Aruna Devi and R. Saveeth. 2014. A novel approach on Tamil text classification using C-feature. Int. J. Sci. Res. Dev. 02 (2014), 343--345.Google Scholar
- K. Borna and R. Ghanbari. 2019. Hierarchical LSTM network for text classification. SN Appl. Sci. 1, 9 (2019), 1124.Google Scholar
Cross Ref
- Bárbara Cervantes, Raúl Monroy, Miguel Angel Medina-Pérez, Miguel Gonzalez-Mendoza, and Jose Ramirez-Marquez. 2018. Some features speak loud, but together they all speak louder: A study on the correlation between classification error and feature usage in decision-tree classification ensembles. Eng. Appl. Artif. Intell. 67 (2018), 270--282.Google Scholar
Digital Library
- Yves Chauvin and David E. Rumelhart. 2013. Backpropagation: Theory, architectures, and applications. Psychology press, 1--34.Google Scholar
- Y. T. Chena and M. C. Chen. 2011. Using chi-square statistics to measure similarities for text categorization. Expert Syst. Appl. 38 (2011), 3085--3090.Google Scholar
Digital Library
- B. J. Choi, J. H. Park, and S. Lee. 2019. Adaptive convolution for text classification. In Proceedings of the International Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2475--2485.Google Scholar
- Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 (2006), 1--30.Google Scholar
Digital Library
- Ankita Dhar, Niladri Sekhar Dash, and Kaushik Roy. 2018. Application of TF-IDF feature for categorizing documents of online bangla web text corpus. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’18). 51--59.Google Scholar
Cross Ref
- Ankita Dhar, Niladri Sekhar Dash, and Kaushik Roy. 2018. Categorization of Bangla web text documents based on TF-IDF-ICF text analysis scheme. In Proceedings of the Annual Conference of the Computer Security Institute (CSI’18). 477--484.Google Scholar
Cross Ref
- Ankita Dhar, Niladri Sekhar Dash, and Kaushik Roy. 2018. Classification of Bangla text documents based on inverse class frequency. In Proceedings of the International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU’18). 1--6.Google Scholar
Cross Ref
- Asif Ekbal, Sudip Naskar, and Sivaji Bandyopadhyay. 2007. Named entity recognition and transliteration in Bengali. Ling. Invest. 30 (2007), 95--114.Google Scholar
- Ethnologue. 2019. Retrieved from https://www.ethnologue.com/language/ben.Google Scholar
- G. Feng, S. Li, T. Sun, and B. Zhang. 2018. A probabilistic model derived term weighting scheme for text classification. Pattern Recogn. Lett. 110 (2018), 23--9.Google Scholar
Cross Ref
- Nidhi Gupta and Vishal Gupta. 2012. Punjabi text classification using Naive Bayes, centroid and hybrid approach. In Proceedings of the COLING Workshop 2010 Workshop on South and Southeast Asian Natural Language Processing (WSSANLP’12). 109--122.Google Scholar
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explor. 11 (2009), 10--18.Google Scholar
Digital Library
- Robert Hecht-Nielsen. 1992. Theory of the backpropagation neural network. In Neural Networks for Perception. Academic Press, 65--93.Google Scholar
Digital Library
- Md. Saiful Islam, Fazla Elahi Md Jubayer, and Syed Ikhtiar Ahmed. 2017. A comparative study on different types of approaches to Bengali document categorization. In Proceedings of the International Conference on Engineering Research, Innovation and Education (ICERIE’17). 6.Google Scholar
- Md. Saiful Islam, Fazla Elahi Md Jubayer, and Syed Ikhtiar Ahmed. 2017. A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In Proceedings of the International Conference on Electrical, Communication and Computer Engineering (ICECCE’17). 191--196.Google Scholar
- Mingyang Jiang, Yanchun Liang, Xiaoyue Feng, Xiaojing Fan, Zhili Pei, Yu Xue, and Renchu Guan. 2018. Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29 (2018), 61--70.Google Scholar
Digital Library
- C. Jin, T. Ma, R. Hou, M. Tang, Y. Tian, A. Al-Dhelaan, and M. Al-Rodhaan. 2015. Chi-square statistics feature selection based on term frequency and distribution for text categorization. IETE J. Res. 61 (2015), 351--362.Google Scholar
- F. Kabir, S. Siddique, M. R. A. Kotwal, and M. N. Huda. 2015. Bangla text document categorization using Stochastic Gradient Descent (SGD) classifier. In Proceedings of the International Conference on Cognitive Computing and Information Processing (CCIP’15). 1--4.Google Scholar
- Fatemeh Zarisfi Kermani, Esfandiar Eslami, and Faramarz Sadeghi. 2019. Global Filter--Wrapper method based on class-dependent correlation for text classification. Eng. Appl. Artif. Intell. 85 (2019), 619--633.Google Scholar
Cross Ref
- Y. Ko. 2012. A study of term weighting schemes using class information for text classification. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1029--1030.Google Scholar
Digital Library
- K. Kowsari, M. Heidarysafa, D. E. Brown, K. J. Meimandi, and L. E. Barnes. 2018. RMDL: Random multimodel deep learning for classification. In Proceedings of the International Conference on Information System and Data Mining (ICISDM’18). 11.Google Scholar
- Mahdieh Labani, Parham Moradi, Fardin Ahmadizar, and Mahdi Jalili. 2018. A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70 (2018), 25--37.Google Scholar
Cross Ref
- K. B. M. Lakmali and P. S. Haddela. 2017. Effectiveness of rule-based classifiers in Sinhala text categorization. In Proceedings of the National Information Technology Conference (NITC’17). 153--158.Google Scholar
- Junjie Li, Haoran Li, Xiaomian Kang, Haitong Yang, and Chengqing Zong. 2018. Incorporating multi-level user preference into document-level sentiment classification. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 1 (2018), 1--17.Google Scholar
- Shing-Hwa Lu, Ding-An Chiang, Huan-Chao Keh, and Hui-Hua Huang. 2010. Chinese text classification by the Naïve Bayes classifier and the associative classifier with multiple confidence threshold values. Knowl.-Based Syst. 23, 6 (2010), 598--604.Google Scholar
Digital Library
- A. Mahabal, J. Baldridge, B. K. Ayan, V. Perot, and D. Roth. 2019. Text classification with few examples using controlled generalization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3158--3167.Google Scholar
- F. D. Malliaros and K. Skianis. 2015. Graph-based term weighting for text categorization. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM’15). 1473--1479.Google Scholar
- Tatsuya Nakamura, Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2018. Wikipedia-based relatedness measurements for multilingual short text clustering. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 2 (2018), 1--25.Google Scholar
Digital Library
- Hamid Parvin, Atousa Dahbashi, Sajad Parvin, and Behrouz Minaei-Bidgoli. 2012. Improving Persian text classification and clustering using Persian thesaurus. In Distributed Computing and Artificial Intelligence. Springer, Berlin, Heidelberg, 493--500.Google Scholar
- J. J. Patil and N. Bogiri. 2015. Automatic text categorization: Marathi documents. In Proceedings of the International Conference on Energy Systems and Applications. IEEE, 689--694.Google Scholar
- Meera Patil and Pravin Game. 2014. Comparison of Marathi text classifiers. ACEEE Int. J. Inf. Technol. 04, 01 (2014), 11--22.Google Scholar
- F. C. Pembe and T. Gungor. 2014. A tree-based learning approach for document structure analysis and its application to web search. Natur. Lang. Eng. 21 (2014), 569--605.Google Scholar
Cross Ref
- Shalini Puri and S. P. Singh. 2016. A technical study and analysis of text classification techniques in N-lingual documents. In Proceedings of the International Conference on Computational Collective Intelligence (ICCCI’16). 542--546.Google Scholar
- Q. A. Al-Radaideh and S. S. Al-Khateeb. 2015. An associative rule-based classifier for Arabic medical text. Int. J. Knowl. Eng. Data Min. 03 (2015), 255--273.Google Scholar
Digital Library
- K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, and B. Palaniappan. 2009. Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36, 8 (2009), 10914--10918.Google Scholar
Digital Library
- Rajnish M. Rakholia and Jatinderkumar R. Saini. 2017. Classification of Gujarati documents using Naive Bayes classifier. Ind. J. Sci. Technol. 10, 5 (2017), 1--9.Google Scholar
Cross Ref
- J. Sarmah, N. Saharia, and K. Shikhar. 2012. A novel approach for document classification using Assamese WordNet. In Proceedings of the International Global Wordnet Conference. 324--329.Google Scholar
- Stopwords. 2019. Retrieved from https://www.isical.ac.in.Google Scholar
- Mayy M. Al-Tahrawi. 2015. Arabic text categorization using logistic regression. Int. J. Intell. Syst. Appl. 06 (2015), 71--78.Google Scholar
- E. S. Tellez, D. Moctezuma, S. Miranda-Jiménez, and M. Graff. 2018. An automated text categorization framework based on hyperparameter optimization. Knowl.-Based Syst. 149 (2018), 110--123.Google Scholar
Digital Library
- S. K. Thakur and V. K. Singh. 2014. A lexicon pool augmented Naive Bayes classifier for nepali text. In Proceedings of the International Conference on Contemporary Computing. IEEE, 542--546.Google Scholar
- B. Trstenjak, S. Mikac, and D. Donko. 2014. KNN with TF-IDF based framework for text categorization. Proc. Eng. 69 (2014), 1356--1364.Google Scholar
Cross Ref
- Sushma R. Vispute and M. A. Potey. 2014. Automatic text categorization of marathi documents using clustering technique. In Proceedings of the International Conference on Advanced Communications Technology (ICACT’14). 1--5.Google Scholar
- Luis H. S. Vogado, Rodrigo M. S. Veras, Flavio. H. D. Araujo, Romuere R. V. Silva, and Kelson R. T. Aires. 2018. Leukemia diagnosis in blood slides using transfer learning in CNNs and SVM for classification. Eng. Appl. Artif. Intell. 72 (2018), 415--422.Google Scholar
Digital Library
- Chin Heng Wan, Lam Hong Lee, Rajprasad Rajkumar, and Dino Isa. 2012. A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine. Expert Syst. Appl. 39 (2012), 11880--11888.Google Scholar
Digital Library
- wikipedia. 2019. Retrieved from https://en.wikipedia.org/wiki/Languages_used_on_the_Internet.Google Scholar
- Haibing Wu, Xiaodong Gu, and Yiwei Gu. 2017. Balancing between over-weighting and under-weighting in supervised term weighting. Inf. Process. Manage. 53, 2 (2017), 547--557.Google Scholar
Digital Library
Index Terms
CESS-A System to Categorize Bangla Web Text Documents
Recommendations
An Evaluation of Passage-Based Text Categorization
Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text ...
Automatic Category Generation for Text Documents by Self-Organizing Maps
IJCNN '00: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 3 - Volume 3Recently knowledge discovery and data mining in unstructured or semi-structured texts has been attracted lots of attention from both commercial and research fields. The task is not easy to tackle due to the unstructured nature of ordinary text ...
Categorization of On-Line Handwritten Documents
DAS '08: Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis SystemsWith the growth of on-line handwriting technologies, managing facilities for handwritten documents, such as retrieval of documents by topic, are required. These documents can contain graphics, equations or text for instance. This work reports ...






Comments