Abstract
Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a morphological process. Moreover, most of Sundanese affixes are derivational, so removing the derivational affix as the first step is reasonable. To handle ambiguity, the last recognized affix was returned as the result. As the baseline, a Confix-Stripping Approach that applies Porter Stemmer for the Indonesian language was used. This stemmer shares similarities in terms of affix type, but uses a different stemming order. To observe whether the baseline stems the Sundanese affixed word properly, some features that were not covered by the baseline, such as the infix and allomorph removal, were added. The evaluation was done using 4,453 unique affixed words collected from Sundanese online magazines. The experiment shows that, as a whole, our stemmer outperforms the modified baseline in terms of recognized affixed type accuracy and properly stemmed affixed words. Our stemmer recognized 68.87% of the Sundanese affixed types and produced 96.79% of the correctly affixed words; the modified baseline resulted in 21.70% and 71.59%, respectively
- M. Adriani and B. Nazief. 1996. Confix Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Internal publication, Faculty of Computer Science, Universitas Indonesia, Depok, Jakarta.Google Scholar
- M. Adriani, J. Asian, B. Nazief, S. Tahaghoghi, and H. Williams. 2007. Stemming Indonesian: A confix-stripping approach. ACM Trans. Asian Lang. Info. Process. 6 (4), 13:1--13:33. Google Scholar
Digital Library
- A. Arifin and A. Setiono. 2002. Classification of event news documents in Indonesian language using single pass clustering algorithm. In Proceedings of the Seminar on Intelligent Technology and Its Applications (SITIA’02). Teknik Elektro, Sepuluh Nopember Institute of Technology.Google Scholar
- S. Aslamzai and S. Saad. 2015. Pashto language stemming algorithm. J. Teknol. Maklumat Multimedia Asia-Pasifik 4, 1 (2015), 25--37.Google Scholar
- J. P. Blevins. 2014. The morphology of words. In Oxford Handbook of Language Processing. V. Ferreira, M. Goldrick, and M. Miozza (eds.), University of Cambridge Press, Cambridge.Google Scholar
- F. Djajasudarma. 2013. Fonologi dan Gramatika Sunda (Sundanese Phonology and Grammar). PT. Refika Aditama, Bandung.Google Scholar
- L. Dolamic and J. Savoy. 2009. Indexing and stemming approaches for the Czech language. Info. Process. Manage 45, 6 (2009), 714--720. Google Scholar
Digital Library
- M. Hafer. 1974. Word segmentation by letter successor varieties. Info. Stor. Retriev. 10, 371--385.Google Scholar
Cross Ref
- D. Junaedi, O. Herlistiono, and D. Akbar. 2010. Stemmer for basa sunda. Proceedings of the National Seminar in Computer Science. 275--278.Google Scholar
- R. Kansal, V. Goyal, and G. Lehal. 2012. Rule based urdu stemmer. Proceedings of the International Conference on Computational Linguistics (COLING’12). Association of Computational Linguistics, 267--276.Google Scholar
- H. Kridalaksana. 2010. Pembentukan Kata Dalam Bahasa Indonesia (Morphological Process in Indonesian). PT. Gramedia, Jakarta.Google Scholar
- J. B. Lovins. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1--2 (1968) 26--31.Google Scholar
- J. Mayfield and P. McNamee. 2003. Single n-gram stemming. In Proceedings of the Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’03). The Association of Computational Linguistics, 415--416. Google Scholar
Digital Library
- William O'Grady and Michael Dobrovolsky. 1989. Contemporary Linguistics: An Introduction. St. Martin's Press, New York, 91.Google Scholar
- C. D. Paice. 1990. Another stemmer. ACM SIGIR Forum 24, 3, 56--61. Google Scholar
Digital Library
- C. D. Paice. 1994. An evaluation method for stemming algorithm. Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’94). Springer-Verlag, 42--50. Google Scholar
Digital Library
- B. P. Pande, P. Tamta, and H. Dhami. 2013. Generation, implementation and appraisal of an n-gram based stemming algorithm. arXiv:1312.4824.Google Scholar
- M. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.Google Scholar
Digital Library
- A. Purwarianti. 2011. A non-deterministic Indonesian stemmer. Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI’11). IEEE, 1--5.Google Scholar
Cross Ref
- R. Robins. 1983. Sistem dan Struktur Bahasa Sunda (System and Structure of Sundanese Language). Djambatan, Jakarta.Google Scholar
- R. Satjadibrata. 2011. Kamus Sunda-Indonesia. Kiblat Buku Utama, Bandung.Google Scholar
- Y. Sudaryat, A. Prawirasumantri, and K. Yudibrata. 2013. Tata Basa Sunda Kiwari. Yrama Widya, Bandung.Google Scholar
- A. A. Suryani, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat. 2015. Experiment on a phrase-based statistical machine translation using pos tag information for Sundanese into Indonesian. Proceedings of the International Conference on Information Technology Systems and Innovation. IEEE.Google Scholar
- V. Vega. 2001. Information Retrieval for the Indonesian Language. M.S. thesis, National University of Singapore, Singapore.Google Scholar
- X. Xu and W. Croft. 1988. January. Corpus-based stemming using co-ocurrence of word variants. ACM Trans. Info. Syst. 16 (01), 61--81. Google Scholar
Digital Library
Index Terms
The Rule-Based Sundanese Stemmer
Recommendations
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages
The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS ...
A Fast Corpus-Based Stemmer
Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
Stemming Hausa text: using affix-stripping rules and reference look-up
Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their pre-processing ...






Comments