Abstract
There has been recent interest in statistical approaches to Korean morphological analysis. However, previous studies have been based mostly on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a conditional random field (CRF). We present a two-stage discriminative approach based on CRFs for Korean morphological analysis. Similar to methods used for Chinese, we perform two disambiguation procedures based on CRFs: (1) morpheme segmentation and (2) POS tagging. In morpheme segmentation, an input sentence is segmented into sequences of morphemes, where a morpheme unit is either atomic or compound. In the POS tagging procedure, each morpheme (atomic or compound) is assigned a POS tag. Once POS tagging is complete, we carry out a post-processing of the compound morphemes, where each compound morpheme is further decomposed into atomic morphemes, which is based on pre-analyzed patterns and generalized HMMs obtained from the given tagged corpus. Experimental results show the promise of our proposed method.
- Jae-Hyeok Choi and Sang-Jo Lee. 1993. A method for reducing dictionary access with bidirectional longest match strategy in Korean morphological analyzer. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 20, 10, 1497--1507.Google Scholar
- Jeen-Pyo Hong. 2008. Korean part-of-speech tagger using Eojeol patterns. Master’s thesis, Changwon National University.Google Scholar
- Seung-Shik Kang and Yung Taek Kim. 1994. Syllable-based model for the Korean morphology. In Proceedings of the 15th Conference on Computational Linguistics (COLING’94). Vol. 1, 221--226. Google Scholar
Digital Library
- Deok-Bong Kim, Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim. 1994. A two-level morphological analysis of Korean. In Proceedings of the 15th Conference on Computational Linguistics (COLING’94). Vol. 1, 535--539. Google Scholar
Digital Library
- Jae-Hoon Kim, Byung-Gyu Jang, Gil Chang Kim, and Jungyun Seo. 1995. Morphological ambiguity reduction using subsumption relation in Korean. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95).Google Scholar
- Seong-Yong Kim. 1987. A Morphological Analyzer for Korean Language with Tabular Parsing Method and Connectivity Information. Master’s thesis, KAIST.Google Scholar
- Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In Proceedings of the 8th International Joint Conference on Artificial Intelligence (IJCAI’83). Vol. 2, 683--685. Google Scholar
Digital Library
- Taku Kudo. 2006. MeCab: Yet another part-of-speech and morphological analyzer. http://mecab.sourceforge.net.Google Scholar
- Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 230--237.Google Scholar
- Oh-Woog Kwon, Yujin Chung, Mi-Young Kim, Dong-Won Ryu, Moon-Ki Lee, and Jong-Hyeok Lee. 1999. Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information. In Proceedings of the MATEC Web Conferences (MATEC’99). 76--88.Google Scholar
- Changki Lee. 2013. Joint models for Korean word spacing and POS tagging using structural SVM. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 40, 12, 826--832.Google Scholar
- Changki Lee and Myung-Gil Jang. 2009. Large-margin training of dependency parsers using Pegasos algorithm. ETRI J. 31, 2, 121--128.Google Scholar
Cross Ref
- Changki Lee and Hyunki Kim. 2013. Automatic Korean word spacing using Pegasos algorithm. Inf. Process. Manage. 49, 1, 370--379. Google Scholar
Digital Library
- Do-Gil Lee and Hae-Chang Rim. 2005. Probabilistic models for Korean morphological analysis. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP’05).Google Scholar
- Do-Gil Lee and Hae-Chang Rim. 2009. Probabilistic modeling of Korean morphology. IEEE Trans. Audio Speech Lang. Proc. 17, 5, 945--955. Google Scholar
Digital Library
- Gary Geunbae Lee, Jong-Hyeok Lee, and Jeongwon Cha. 2002. Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean. Comput. Linguistics 28, 1, 53--70. Google Scholar
Digital Library
- Jae-Sung Lee. 2007. A probabilistic context sensitive rewriting method for effective transliteration variants generation. J. Korea Contents Assoc. (in Korean) 7, 2, 73--83.Google Scholar
Cross Ref
- Jae-Sung Lee. 2011. Three-step probabilistic model for Korean morphological analysis. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 38, 5, 257--268.Google Scholar
- Heui-Suk Lim, Sang-Zoo Lee, and Hae-Chang Rim. 1995. An efficient Korean morphological analysis using exclusive information. In Proceedings of the International Conference of Computational Processing Oriental Language (ICCPOL’95).Google Scholar
- Seung-Hoon Na, Seong-Il Yang, Chang-Hyun Kim, Oh-Woog Kwon, and Young-Kil Kim. 2012. CRFs for Korean morpheme segmentation and POS tagging. In Proceedings of the 24th Annual Conference on Human and Cognitive Language Technology (HCLT’12) (in Korean).Google Scholar
- Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2 (ACL-HLT’11). 529--533. Google Scholar
Digital Library
- Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based?. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 277--284.Google Scholar
- Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). Google Scholar
Digital Library
- Dan Roth and Wen-tau Yih. 2005. Integer linear programming inference for conditional random fields. In Proceedings of the 22nd International Conference on Machine Learning (ICML’05). 736--743. Google Scholar
Digital Library
- Sunita Sarawagi and William W. Cohen. 2004. Semi-Markov conditional random fields for information extraction. In Proceedings of the 18th Annual Conference on Neural Information Processing Systems (NIPS’04).Google Scholar
- Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). 807--814. Google Scholar
Digital Library
- Kwangseob Shim. 2011. Syllable-based POS tagging without Korean morphological analysis. J. Korean Soc. Cogn. Sci. (in Korean) 22, 3, 327--345.Google Scholar
Cross Ref
- Kwangseob Shim and Jaehyung Yang. 2002. MACH: A supersonic Korean morphological analyzer. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). Vol. 1, 1--7. Google Scholar
Digital Library
- Joon-Choul Shin and Cheol-Young Ock. 2012. A Korean morphological analyzer using a pre-analyzed partial word-phrase dictionary. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 39, 5, 415--424.Google Scholar
- Nianwen Xue. 2003. Chinese word segmentation as character tagging. Int. J. Comput. Linguistics Chinese Lang. Process. 8, 1.Google Scholar
- Seung Hyun Yang and Young-Sum Kim. 2000. A high-speed Korean morphological analysis method based on pre-analyzed partial words. J. Korean Inf. Sci. Soc. Softw. Appl. (in Korean) 27, 3, 290--301.Google Scholar
- Shun-Zheng Yu. 2010. Hidden semi-Markov models. Artif. Intell. 174, 2, 215--243. Google Scholar
Digital Library
Index Terms
Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging
Recommendations
Korean Part-of-speech Tagging Based on Morpheme Generation
Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger ...
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS19: Proceedings of the 2nd International Conference on Networking, Information Systems & SecurityPart-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Chinese word segmentation as morpheme-based lexical chunking
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...






Comments