skip to main content
research-article

The Rule-Based Sundanese Stemmer

Published:21 July 2018Publication History
Skip Abstract Section

Abstract

Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a morphological process. Moreover, most of Sundanese affixes are derivational, so removing the derivational affix as the first step is reasonable. To handle ambiguity, the last recognized affix was returned as the result. As the baseline, a Confix-Stripping Approach that applies Porter Stemmer for the Indonesian language was used. This stemmer shares similarities in terms of affix type, but uses a different stemming order. To observe whether the baseline stems the Sundanese affixed word properly, some features that were not covered by the baseline, such as the infix and allomorph removal, were added. The evaluation was done using 4,453 unique affixed words collected from Sundanese online magazines. The experiment shows that, as a whole, our stemmer outperforms the modified baseline in terms of recognized affixed type accuracy and properly stemmed affixed words. Our stemmer recognized 68.87% of the Sundanese affixed types and produced 96.79% of the correctly affixed words; the modified baseline resulted in 21.70% and 71.59%, respectively

References

  1. M. Adriani and B. Nazief. 1996. Confix Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Internal publication, Faculty of Computer Science, Universitas Indonesia, Depok, Jakarta.Google ScholarGoogle Scholar
  2. M. Adriani, J. Asian, B. Nazief, S. Tahaghoghi, and H. Williams. 2007. Stemming Indonesian: A confix-stripping approach. ACM Trans. Asian Lang. Info. Process. 6 (4), 13:1--13:33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arifin and A. Setiono. 2002. Classification of event news documents in Indonesian language using single pass clustering algorithm. In Proceedings of the Seminar on Intelligent Technology and Its Applications (SITIA’02). Teknik Elektro, Sepuluh Nopember Institute of Technology.Google ScholarGoogle Scholar
  4. S. Aslamzai and S. Saad. 2015. Pashto language stemming algorithm. J. Teknol. Maklumat Multimedia Asia-Pasifik 4, 1 (2015), 25--37.Google ScholarGoogle Scholar
  5. J. P. Blevins. 2014. The morphology of words. In Oxford Handbook of Language Processing. V. Ferreira, M. Goldrick, and M. Miozza (eds.), University of Cambridge Press, Cambridge.Google ScholarGoogle Scholar
  6. F. Djajasudarma. 2013. Fonologi dan Gramatika Sunda (Sundanese Phonology and Grammar). PT. Refika Aditama, Bandung.Google ScholarGoogle Scholar
  7. L. Dolamic and J. Savoy. 2009. Indexing and stemming approaches for the Czech language. Info. Process. Manage 45, 6 (2009), 714--720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Hafer. 1974. Word segmentation by letter successor varieties. Info. Stor. Retriev. 10, 371--385.Google ScholarGoogle ScholarCross RefCross Ref
  9. D. Junaedi, O. Herlistiono, and D. Akbar. 2010. Stemmer for basa sunda. Proceedings of the National Seminar in Computer Science. 275--278.Google ScholarGoogle Scholar
  10. R. Kansal, V. Goyal, and G. Lehal. 2012. Rule based urdu stemmer. Proceedings of the International Conference on Computational Linguistics (COLING’12). Association of Computational Linguistics, 267--276.Google ScholarGoogle Scholar
  11. H. Kridalaksana. 2010. Pembentukan Kata Dalam Bahasa Indonesia (Morphological Process in Indonesian). PT. Gramedia, Jakarta.Google ScholarGoogle Scholar
  12. J. B. Lovins. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1--2 (1968) 26--31.Google ScholarGoogle Scholar
  13. J. Mayfield and P. McNamee. 2003. Single n-gram stemming. In Proceedings of the Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’03). The Association of Computational Linguistics, 415--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. William O'Grady and Michael Dobrovolsky. 1989. Contemporary Linguistics: An Introduction. St. Martin's Press, New York, 91.Google ScholarGoogle Scholar
  15. C. D. Paice. 1990. Another stemmer. ACM SIGIR Forum 24, 3, 56--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. D. Paice. 1994. An evaluation method for stemming algorithm. Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’94). Springer-Verlag, 42--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. P. Pande, P. Tamta, and H. Dhami. 2013. Generation, implementation and appraisal of an n-gram based stemming algorithm. arXiv:1312.4824.Google ScholarGoogle Scholar
  18. M. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Purwarianti. 2011. A non-deterministic Indonesian stemmer. Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI’11). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  20. R. Robins. 1983. Sistem dan Struktur Bahasa Sunda (System and Structure of Sundanese Language). Djambatan, Jakarta.Google ScholarGoogle Scholar
  21. R. Satjadibrata. 2011. Kamus Sunda-Indonesia. Kiblat Buku Utama, Bandung.Google ScholarGoogle Scholar
  22. Y. Sudaryat, A. Prawirasumantri, and K. Yudibrata. 2013. Tata Basa Sunda Kiwari. Yrama Widya, Bandung.Google ScholarGoogle Scholar
  23. A. A. Suryani, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat. 2015. Experiment on a phrase-based statistical machine translation using pos tag information for Sundanese into Indonesian. Proceedings of the International Conference on Information Technology Systems and Innovation. IEEE.Google ScholarGoogle Scholar
  24. V. Vega. 2001. Information Retrieval for the Indonesian Language. M.S. thesis, National University of Singapore, Singapore.Google ScholarGoogle Scholar
  25. X. Xu and W. Croft. 1988. January. Corpus-based stemming using co-ocurrence of word variants. ACM Trans. Info. Syst. 16 (01), 61--81. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Rule-Based Sundanese Stemmer

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 4
      December 2018
      193 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3229525
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 July 2018
      • Accepted: 1 March 2018
      • Received: 1 December 2017
      Published in tallip Volume 17, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!