Abstract
Document clustering techniques often produce clusters that require human intervention to interpret the meaning of such clusters. Automatic cluster labeling refers to the process of assigning a meaningful phrase to a cluster as a label. This article proposes an unsupervised method for cluster labeling that is based on noun phrase chunking. The proposed method is compared with four other statistical-based methods, including Z-Order, M-Order, T-Order, and YAKE. In addition to the statistical measures based labeling schemes, the approach is also compared with two graph-based techniques: TextRank and PositionRank. The experiments were performed on the low-resource Urdu language corpus of News Headlines. The proposed approach's effectiveness was evaluated using cosine similarity, the Jaccard index, and feedback received from human evaluators. The results show that the proposed method outperforms other methods. It was found that the labels produced were more relevant and semantically rich in contrast to other approaches.
- [1] . 2018. Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In Proceedings of the 2017 1st International Conference on Informatics and Computational Sciences (ICICoS’17).
DOI: Google ScholarCross Ref
- [2] . 2016. Automatic label generation for news comment clusters. In Proceedings of the 9th International Natural Language Generation Conference (INLG’16).
DOI: Google ScholarCross Ref
- [3] . 2010. Urdu noun phrase chunking: HMM based approach. In Proceedings of the 2010 International Conference on Educational and Information Technology (ICEIT’10).
DOI: Google ScholarCross Ref
- [4] . 2018. Automatic cluster labeling based on phylogram analysis. In Proceedings of the International Joint Conference on Neural Networks.
DOI: Google ScholarCross Ref
- [5] . 2016. Automatic labelling of topics with neural embeddings. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): Technical Papers.Google Scholar
- [6] . 2016. Pke: An open source Python-based keyphrase extraction toolkit. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): System Demonstrations.Google Scholar
- [7] . 2013. TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13).Google Scholar
- [8] . 2020. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci.(NY) 509 (2020)), 257–289.
DOI: Google ScholarDigital Library
- [9] . 2014. Automatic labelling of topic models learned from Twitter by summarisation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14).
DOI: Google ScholarCross Ref
- [10] . 2014. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International Conference on Data Mining (SDM’14).
DOI: Google ScholarCross Ref
- [11] . 2017. Urdu language processing: A survey. Artif. Intell. Rev. 47, 3 (2017)), 279–311.Google Scholar
Digital Library
- [12] . 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 5 (1971)), 378–382.
DOI: Google ScholarCross Ref
- [13] . 2017. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17): Long Papers.
DOI: Google ScholarCross Ref
- [14] . 1998. The C-value/NC-value method of automatic recognition for multi-word terms. In Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, 1513. Springer, 585–604.
DOI: Google ScholarCross Ref
- [15] . 2018. United we stand: Using multiple strategies for topic labeling. In Natural Language Processing and Information Systems. Lecture Notes in Computer Science, 10859. Springer, 352–363.
DOI: Google ScholarCross Ref
- [16] . 2008. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61, Pt. 1 (2008)), 29–48.
DOI: Google ScholarCross Ref
- [17] . 2014). Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters (4th ed.). Advanced Analytics LLC.Google Scholar
- [18] . 2019. Automatic labeling of topic models using graph-based ranking. IEEE Access 7 (2019)), 131593–131608.
DOI: Google ScholarCross Ref
- [19] . 1998. Summarization evaluation methods: Experiments and analysis. In Proceedings of the Spring Symposium on Intelligent Text Summarization (AAAI’98). 60–68.Google Scholar
- [20] . 1996. Regular expressions for language engineering. Nat. Lang. Eng. 2, 4 (Dec. 1996), 305–328.
DOI: Google ScholarDigital Library
- [21] . 2015. Automatic labelling of topic models using word vectors and letter trigram vectors. In Information Retrieval Technology. Lecture Notes in Computer Science, 9460. Springer, 253–264.
DOI: Google ScholarCross Ref
- [22] and Farikhin. 2017. An automatic labeling of K-Means clusters based on chi-square value. J. Phys.: Conf. Ser. 801 (2017), 012071.
DOI: Google ScholarCross Ref
- [23] . 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977)), 159–174.
DOI: Google ScholarCross Ref
- [24] . 2011. Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11).Google Scholar
Digital Library
- [25] . 2015. Labeling clusters from both linguistic and statistical perspectives: A hybrid approach. Knowl. Based Syst. 76 (2015)), 219–227.
DOI: Google ScholarDigital Library
- [26] . 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 2 (1982)), 129–137.
DOI: Google ScholarDigital Library
- [27] . 2021. Interpretable semantic textual similarity of sentences using alignment of chunks with classification and regression. Appl. Intell. 51 (2021)), 7322–7349.Google Scholar
Digital Library
- [28] . 2012. Interrater reliability: The kappa statistic. Biochem. Medica 22, 3 (2012)), 276–282.
DOI: Google ScholarCross Ref
- [29] . 2007. Automatic labeling of multinomial topic models. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
DOI: Google ScholarDigital Library
- [30] . 2004. TextRank: Bringing order into texts. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 404–411.Google Scholar
- [31] . 2020. Modeling POS tagging for the Urdu language. In Proceedings of the 2020 International Conference on Emerging Trends in Smart Technologies (ICETST’20).
DOI: Google ScholarCross Ref
- [32] . 2020. Cluster analysis of Urdu tweets. J. King Saud Univ. Comput. Inf. Sci. Online August 26, 2020.
DOI: Google ScholarDigital Library
- [33] . 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1–7 (1998)), 107–117.
DOI: Google ScholarDigital Library
- [34] . 2019. Topic models and fusion methods: A union to improve text clustering and cluster labeling. Int. J. Interact. Multimed. Artif. Intell. 5, 4 (2019)), 28–34.
DOI: Google ScholarCross Ref
- [35] . 2017. Cluster labelling using chi-square-based keyword ranking and mutual information score: A hybrid approach. Int. J. Intell. Syst. Des. Comput. 1, 1–2 (2017)), 145.
DOI: Google ScholarCross Ref
- [36] . 2015. Topical word importance for fast keyphrase extraction. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion).
DOI: Google ScholarDigital Library
- [37] . 2020. A review of unsupervised keyphrase extraction methods using within-collection resources. Symmetry (Basel) 12, 11 (2020)), 1864.
DOI: Google ScholarCross Ref
- [38] . 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the National Conference on Artificial Intelligence.Google Scholar
Index Terms
Automatic Labeling of Clusters for a Low-Resource Urdu Language
Recommendations
Transductive Multilabel Learning via Label Set Propagation
The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
Urdu language processing: a survey
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core ...






Comments