skip to main content
research-article

CESS-A System to Categorize Bangla Web Text Documents

Authors Info & Claims
Published:18 June 2020Publication History
Skip Abstract Section

Abstract

Technology has evolved remarkably, which has led to an exponential increase in the availability of digital text documents of disparate domains over the Internet. This makes the retrieval of the information a very much time- and resource-consuming task. Thus, a system that can categorize such documents based on their domains can truly help the users in obtaining the required information with relative ease and also reduce the workload of the search engines. This article presents a text categorization system (CESS) that categorizes text document using newly proposed hybrid features that combines term frequency-inverse document frequency-inverse class frequency and modified chi-square methods. Experiments were performed on real-world Bangla documents from eight domains comprises of 24,29,857 tokens, and the highest accuracy of 99.91% has been obtained with multilayer perceptron-based classification. Also, the experiments were tested on Reuters-21578 and 20 Newsgroups datasets and obtained accuracies of 97.29% and 94.67%, respectively, to show the language-independent nature of the system.

References

  1. Diab Abuaiadah. 2016. Using bisect K-means clustering technique in the analysis of arabic documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 3 (2016), 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Md Tanvir Alam and Md Mofijul Islam. 2018. BARD: Bangla article classification using a new comprehensive dataset. In Proceedings of the International Conference on Bangla Speech and Language Processing. 1--5.Google ScholarGoogle Scholar
  3. Abbas Raza Ali and Maliha Ijaz. 2009. Urdu text classification. In Proceedings of the International Conference on Frontiers of Information Technology (ICFIT’09). 21:1--21:7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Aruna Devi and R. Saveeth. 2014. A novel approach on Tamil text classification using C-feature. Int. J. Sci. Res. Dev. 02 (2014), 343--345.Google ScholarGoogle Scholar
  5. K. Borna and R. Ghanbari. 2019. Hierarchical LSTM network for text classification. SN Appl. Sci. 1, 9 (2019), 1124.Google ScholarGoogle ScholarCross RefCross Ref
  6. Bárbara Cervantes, Raúl Monroy, Miguel Angel Medina-Pérez, Miguel Gonzalez-Mendoza, and Jose Ramirez-Marquez. 2018. Some features speak loud, but together they all speak louder: A study on the correlation between classification error and feature usage in decision-tree classification ensembles. Eng. Appl. Artif. Intell. 67 (2018), 270--282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yves Chauvin and David E. Rumelhart. 2013. Backpropagation: Theory, architectures, and applications. Psychology press, 1--34.Google ScholarGoogle Scholar
  8. Y. T. Chena and M. C. Chen. 2011. Using chi-square statistics to measure similarities for text categorization. Expert Syst. Appl. 38 (2011), 3085--3090.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. J. Choi, J. H. Park, and S. Lee. 2019. Adaptive convolution for text classification. In Proceedings of the International Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2475--2485.Google ScholarGoogle Scholar
  10. Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 (2006), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ankita Dhar, Niladri Sekhar Dash, and Kaushik Roy. 2018. Application of TF-IDF feature for categorizing documents of online bangla web text corpus. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’18). 51--59.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ankita Dhar, Niladri Sekhar Dash, and Kaushik Roy. 2018. Categorization of Bangla web text documents based on TF-IDF-ICF text analysis scheme. In Proceedings of the Annual Conference of the Computer Security Institute (CSI’18). 477--484.Google ScholarGoogle ScholarCross RefCross Ref
  13. Ankita Dhar, Niladri Sekhar Dash, and Kaushik Roy. 2018. Classification of Bangla text documents based on inverse class frequency. In Proceedings of the International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU’18). 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  14. Asif Ekbal, Sudip Naskar, and Sivaji Bandyopadhyay. 2007. Named entity recognition and transliteration in Bengali. Ling. Invest. 30 (2007), 95--114.Google ScholarGoogle Scholar
  15. Ethnologue. 2019. Retrieved from https://www.ethnologue.com/language/ben.Google ScholarGoogle Scholar
  16. G. Feng, S. Li, T. Sun, and B. Zhang. 2018. A probabilistic model derived term weighting scheme for text classification. Pattern Recogn. Lett. 110 (2018), 23--9.Google ScholarGoogle ScholarCross RefCross Ref
  17. Nidhi Gupta and Vishal Gupta. 2012. Punjabi text classification using Naive Bayes, centroid and hybrid approach. In Proceedings of the COLING Workshop 2010 Workshop on South and Southeast Asian Natural Language Processing (WSSANLP’12). 109--122.Google ScholarGoogle Scholar
  18. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explor. 11 (2009), 10--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Robert Hecht-Nielsen. 1992. Theory of the backpropagation neural network. In Neural Networks for Perception. Academic Press, 65--93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Md. Saiful Islam, Fazla Elahi Md Jubayer, and Syed Ikhtiar Ahmed. 2017. A comparative study on different types of approaches to Bengali document categorization. In Proceedings of the International Conference on Engineering Research, Innovation and Education (ICERIE’17). 6.Google ScholarGoogle Scholar
  21. Md. Saiful Islam, Fazla Elahi Md Jubayer, and Syed Ikhtiar Ahmed. 2017. A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In Proceedings of the International Conference on Electrical, Communication and Computer Engineering (ICECCE’17). 191--196.Google ScholarGoogle Scholar
  22. Mingyang Jiang, Yanchun Liang, Xiaoyue Feng, Xiaojing Fan, Zhili Pei, Yu Xue, and Renchu Guan. 2018. Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29 (2018), 61--70.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Jin, T. Ma, R. Hou, M. Tang, Y. Tian, A. Al-Dhelaan, and M. Al-Rodhaan. 2015. Chi-square statistics feature selection based on term frequency and distribution for text categorization. IETE J. Res. 61 (2015), 351--362.Google ScholarGoogle Scholar
  24. F. Kabir, S. Siddique, M. R. A. Kotwal, and M. N. Huda. 2015. Bangla text document categorization using Stochastic Gradient Descent (SGD) classifier. In Proceedings of the International Conference on Cognitive Computing and Information Processing (CCIP’15). 1--4.Google ScholarGoogle Scholar
  25. Fatemeh Zarisfi Kermani, Esfandiar Eslami, and Faramarz Sadeghi. 2019. Global Filter--Wrapper method based on class-dependent correlation for text classification. Eng. Appl. Artif. Intell. 85 (2019), 619--633.Google ScholarGoogle ScholarCross RefCross Ref
  26. Y. Ko. 2012. A study of term weighting schemes using class information for text classification. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1029--1030.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Kowsari, M. Heidarysafa, D. E. Brown, K. J. Meimandi, and L. E. Barnes. 2018. RMDL: Random multimodel deep learning for classification. In Proceedings of the International Conference on Information System and Data Mining (ICISDM’18). 11.Google ScholarGoogle Scholar
  28. Mahdieh Labani, Parham Moradi, Fardin Ahmadizar, and Mahdi Jalili. 2018. A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70 (2018), 25--37.Google ScholarGoogle ScholarCross RefCross Ref
  29. K. B. M. Lakmali and P. S. Haddela. 2017. Effectiveness of rule-based classifiers in Sinhala text categorization. In Proceedings of the National Information Technology Conference (NITC’17). 153--158.Google ScholarGoogle Scholar
  30. Junjie Li, Haoran Li, Xiaomian Kang, Haitong Yang, and Chengqing Zong. 2018. Incorporating multi-level user preference into document-level sentiment classification. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 1 (2018), 1--17.Google ScholarGoogle Scholar
  31. Shing-Hwa Lu, Ding-An Chiang, Huan-Chao Keh, and Hui-Hua Huang. 2010. Chinese text classification by the Naïve Bayes classifier and the associative classifier with multiple confidence threshold values. Knowl.-Based Syst. 23, 6 (2010), 598--604.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Mahabal, J. Baldridge, B. K. Ayan, V. Perot, and D. Roth. 2019. Text classification with few examples using controlled generalization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3158--3167.Google ScholarGoogle Scholar
  33. F. D. Malliaros and K. Skianis. 2015. Graph-based term weighting for text categorization. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM’15). 1473--1479.Google ScholarGoogle Scholar
  34. Tatsuya Nakamura, Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2018. Wikipedia-based relatedness measurements for multilingual short text clustering. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 2 (2018), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hamid Parvin, Atousa Dahbashi, Sajad Parvin, and Behrouz Minaei-Bidgoli. 2012. Improving Persian text classification and clustering using Persian thesaurus. In Distributed Computing and Artificial Intelligence. Springer, Berlin, Heidelberg, 493--500.Google ScholarGoogle Scholar
  36. J. J. Patil and N. Bogiri. 2015. Automatic text categorization: Marathi documents. In Proceedings of the International Conference on Energy Systems and Applications. IEEE, 689--694.Google ScholarGoogle Scholar
  37. Meera Patil and Pravin Game. 2014. Comparison of Marathi text classifiers. ACEEE Int. J. Inf. Technol. 04, 01 (2014), 11--22.Google ScholarGoogle Scholar
  38. F. C. Pembe and T. Gungor. 2014. A tree-based learning approach for document structure analysis and its application to web search. Natur. Lang. Eng. 21 (2014), 569--605.Google ScholarGoogle ScholarCross RefCross Ref
  39. Shalini Puri and S. P. Singh. 2016. A technical study and analysis of text classification techniques in N-lingual documents. In Proceedings of the International Conference on Computational Collective Intelligence (ICCCI’16). 542--546.Google ScholarGoogle Scholar
  40. Q. A. Al-Radaideh and S. S. Al-Khateeb. 2015. An associative rule-based classifier for Arabic medical text. Int. J. Knowl. Eng. Data Min. 03 (2015), 255--273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, and B. Palaniappan. 2009. Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst. Appl. 36, 8 (2009), 10914--10918.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Rajnish M. Rakholia and Jatinderkumar R. Saini. 2017. Classification of Gujarati documents using Naive Bayes classifier. Ind. J. Sci. Technol. 10, 5 (2017), 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  43. J. Sarmah, N. Saharia, and K. Shikhar. 2012. A novel approach for document classification using Assamese WordNet. In Proceedings of the International Global Wordnet Conference. 324--329.Google ScholarGoogle Scholar
  44. Stopwords. 2019. Retrieved from https://www.isical.ac.in.Google ScholarGoogle Scholar
  45. Mayy M. Al-Tahrawi. 2015. Arabic text categorization using logistic regression. Int. J. Intell. Syst. Appl. 06 (2015), 71--78.Google ScholarGoogle Scholar
  46. E. S. Tellez, D. Moctezuma, S. Miranda-Jiménez, and M. Graff. 2018. An automated text categorization framework based on hyperparameter optimization. Knowl.-Based Syst. 149 (2018), 110--123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. K. Thakur and V. K. Singh. 2014. A lexicon pool augmented Naive Bayes classifier for nepali text. In Proceedings of the International Conference on Contemporary Computing. IEEE, 542--546.Google ScholarGoogle Scholar
  48. B. Trstenjak, S. Mikac, and D. Donko. 2014. KNN with TF-IDF based framework for text categorization. Proc. Eng. 69 (2014), 1356--1364.Google ScholarGoogle ScholarCross RefCross Ref
  49. Sushma R. Vispute and M. A. Potey. 2014. Automatic text categorization of marathi documents using clustering technique. In Proceedings of the International Conference on Advanced Communications Technology (ICACT’14). 1--5.Google ScholarGoogle Scholar
  50. Luis H. S. Vogado, Rodrigo M. S. Veras, Flavio. H. D. Araujo, Romuere R. V. Silva, and Kelson R. T. Aires. 2018. Leukemia diagnosis in blood slides using transfer learning in CNNs and SVM for classification. Eng. Appl. Artif. Intell. 72 (2018), 415--422.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Chin Heng Wan, Lam Hong Lee, Rajprasad Rajkumar, and Dino Isa. 2012. A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine. Expert Syst. Appl. 39 (2012), 11880--11888.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. wikipedia. 2019. Retrieved from https://en.wikipedia.org/wiki/Languages_used_on_the_Internet.Google ScholarGoogle Scholar
  53. Haibing Wu, Xiaodong Gu, and Yiwei Gu. 2017. Balancing between over-weighting and under-weighting in supervised term weighting. Inf. Process. Manage. 53, 2 (2017), 547--557.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CESS-A System to Categorize Bangla Web Text Documents

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 5
      September 2020
      278 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3403646
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2020
      • Online AM: 7 May 2020
      • Accepted: 1 May 2020
      • Revised: 1 April 2020
      • Received: 1 February 2019
      Published in tallip Volume 19, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!