skip to main content
research-article

Automatic Labeling of Clusters for a Low-Resource Urdu Language

Published:29 April 2022Publication History
Skip Abstract Section

Abstract

Document clustering techniques often produce clusters that require human intervention to interpret the meaning of such clusters. Automatic cluster labeling refers to the process of assigning a meaningful phrase to a cluster as a label. This article proposes an unsupervised method for cluster labeling that is based on noun phrase chunking. The proposed method is compared with four other statistical-based methods, including Z-Order, M-Order, T-Order, and YAKE. In addition to the statistical measures based labeling schemes, the approach is also compared with two graph-based techniques: TextRank and PositionRank. The experiments were performed on the low-resource Urdu language corpus of News Headlines. The proposed approach's effectiveness was evaluated using cosine similarity, the Jaccard index, and feedback received from human evaluators. The results show that the proposed method outperforms other methods. It was found that the labels produced were more relevant and semantically rich in contrast to other approaches.

REFERENCES

  1. [1] Adhitama Rifki, Kusumaningrum Retno, and Gernowo Rahmat. 2018. Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In Proceedings of the 2017 1st International Conference on Informatics and Computational Sciences (ICICoS’17). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Aker Ahmet, Paramita Monica, Kurtic Emina, Funk Adam, Barker Emma, Hepple Mark, and Gaizauskas Robert. 2016. Automatic label generation for news comment clusters. In Proceedings of the 9th International Natural Language Generation Conference (INLG’16). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ali Wajid, Malik M. Kamran, Hussain Sarmad, Siddiq Shahid, and Ali Aasim. 2010. Urdu noun phrase chunking: HMM based approach. In Proceedings of the 2010 International Conference on Educational and Information Technology (ICEIT’10). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Araujo Francisco N. C. De, Machado Vinicius P., Soares Antonio H. M., and Veras Rodrigo M. S. De. 2018. Automatic cluster labeling based on phylogram analysis. In Proceedings of the International Joint Conference on Neural Networks. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bhatia Shraey, Lau Jey Han, and Baldwin Timothy. 2016. Automatic labelling of topics with neural embeddings. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): Technical Papers.Google ScholarGoogle Scholar
  6. [6] Boudin Florian. 2016. Pke: An open source Python-based keyphrase extraction toolkit. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): System Demonstrations.Google ScholarGoogle Scholar
  7. [7] Bougouin Adrien and Boudin Florian. 2013. TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13).Google ScholarGoogle Scholar
  8. [8] Campos Ricardo, Mangaravite Vítor, Pasquali Arian, Jorge Alípio, Nunes Célia, and Jatowt Adam. 2020. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci.(NY) 509 (2020)), 257289. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Basave Amparo Elizabeth Cano, He Yulan, and Xu Ruifeng. 2014. Automatic labelling of topic models learned from Twitter by summarisation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Danilevsky Marina, Wang Chi, Desai Nihit, Ren Xiang, Guo Jingyi, and Han Jiawei. 2014. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International Conference on Data Mining (SDM’14). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Daud Ali, Khan Wahab, and Che Dunren. 2017. Urdu language processing: A survey. Artif. Intell. Rev. 47, 3 (2017)), 279311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Fleiss Joseph L.. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 5 (1971)), 378382. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Florescu Corina and Caragea Cornelia. 2017. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17): Long Papers. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Frantzi Katerina T., Ananiadou Sophia, and Tsujii Junichi. 1998. The C-value/NC-value method of automatic recognition for multi-word terms. In Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, 1513. Springer, 585604. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gourru Antoine, Velcin Julien, Roche Mathieu, Gravier Christophe, and Poncelet Pascal. 2018. United we stand: Using multiple strategies for topic labeling. In Natural Language Processing and Information Systems. Lecture Notes in Computer Science, 10859. Springer, 352363. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Li Gwet Kilem. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61, Pt. 1 (2008)), 2948. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Gwet Kilem Li. 2014). Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters (4th ed.). Advanced Analytics LLC.Google ScholarGoogle Scholar
  18. [18] He Dongbin, Wang Minjuan, Khattak Abdul Mateen, Zhang Li, and Gao Wanlin. 2019. Automatic labeling of topic models using graph-based ranking. IEEE Access 7 (2019)), 131593131608. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Jing Hongyan, Barzilay Regina, McKeown Kathleen R., and Elhadad Michael. 1998. Summarization evaluation methods: Experiments and analysis. In Proceedings of the Spring Symposium on Intelligent Text Summarization (AAAI’98). 6068.Google ScholarGoogle Scholar
  20. [20] Karttunen L., Chanod J.-P., grefenstette G., and Schille A.. 1996. Regular expressions for language engineering. Nat. Lang. Eng. 2, 4 (Dec. 1996), 305328. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Kou Wanqiu, Li Fang, and Baldwin Timothy. 2015. Automatic labelling of topic models using word vectors and letter trigram vectors. In Information Retrieval Technology. Lecture Notes in Computer Science, 9460. Springer, 253264. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Kusumaningrum R. and Farikhin. 2017. An automatic labeling of K-Means clusters based on chi-square value. J. Phys.: Conf. Ser. 801 (2017), 012071. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Richard Landis J. and Koch Gary G.. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977)), 159174. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Lau Jey Han, Grieser Karl, Newman David, and Baldwin Timothy. 2011. Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Li Zhixing, Li Juanzi, Liao Yi, Wen Siqiang, and Tang Jie. 2015. Labeling clusters from both linguistic and statistical perspectives: A hybrid approach. Knowl. Based Syst. 76 (2015)), 219227. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Lloyd Stuart P.. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 2 (1982)), 129137. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Majumder Goutam, Pakray Partha, Das Ranjita, and Pinto David. 2021. Interpretable semantic textual similarity of sentences using alignment of chunks with classification and regression. Appl. Intell. 51 (2021)), 73227349.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] McHugh Mary L.. 2012. Interrater reliability: The kappa statistic. Biochem. Medica 22, 3 (2012)), 276282. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Mei Qiaozhu, Shen Xuehua, and Zhai Chengxiang. 2007. Automatic labeling of multinomial topic models. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Mihalcea Rada and Tarau Paul. 2004. TextRank: Bringing order into texts. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 404–411.Google ScholarGoogle Scholar
  31. [31] Nasim Zarmeen, Abidi Shaukat, and Haider Sajjad. 2020. Modeling POS tagging for the Urdu language. In Proceedings of the 2020 International Conference on Emerging Trends in Smart Technologies (ICETST’20). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Nasim Zarmeen and Haider Sajjad. 2020. Cluster analysis of Urdu tweets. J. King Saud Univ. Comput. Inf. Sci. Online August 26, 2020. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Page Lawrence and Brin Sergey. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1–7 (1998)), 107117. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Pourvali Mohsen, Orlando Salvatore, and Omidvarborna Hosna. 2019. Topic models and fusion methods: A union to improve text clustering and cluster labeling. Int. J. Interact. Multimed. Artif. Intell. 5, 4 (2019)), 2834. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Kumar Roul Rajendra and Kumar Sahay Sanjay. 2017. Cluster labelling using chi-square-based keyword ranking and mutual information score: A hybrid approach. Int. J. Intell. Syst. Des. Comput. 1, 1–2 (2017)), 145. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Sterckx Lucas, Demeester Thomas, Deleu Johannes, and Develder Chris. 2015. Topical word importance for fast keyphrase extraction. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Sun Chengyu, Hu Liang, Li Shuai, Li Tuohang, Li Hongtu, and Chi Ling. 2020. A review of unsupervised keyphrase extraction methods using within-collection resources. Symmetry (Basel) 12, 11 (2020)), 1864. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wan Xiaojun and Xiao Jianguo. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the National Conference on Artificial Intelligence.Google ScholarGoogle Scholar

Index Terms

  1. Automatic Labeling of Clusters for a Low-Resource Urdu Language

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 5
      September 2022
      486 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3533669
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 April 2022
      • Online AM: 21 March 2022
      • Accepted: 1 January 2022
      • Revised: 1 November 2021
      • Received: 1 December 2020
      Published in tallip Volume 21, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)157
      • Downloads (Last 6 weeks)8

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!