Abstract
In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tones and meanings in the same writing text. To overcome these problems, we developed an Isarn Dharma character cluster–(IDCC) based statistical model and affixation and integrated it with the named entity recognition method (IDCC-C-based statistical model and affixation with named entity recognition (NER)). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model, based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best F-measure of 92.19, an F-measure that was better than the IDCC longest matching grouping at 2.85.
- Cyril Allauzen, Michael Riley, and Brian Roark. 2016. Distributed representation and estimation of WFST-based n-gram models. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. 32--41.Google Scholar
Cross Ref
- Wirote Aroonmanakun. 2002. Collocation and Thai word segmentation. In Proceedings of the 5th SNLP and 5th Oriental COCOSDA Workshop. 68--75.Google Scholar
- Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 858--867.Google Scholar
- Paisarn Charoenpornsawat and Tanja Schultz. 2008. Improving word segmentation for Thai speech translation. In Proceedings of the Spoken Language Technology Workshop (SLT’08). 241--244.Google Scholar
Cross Ref
- Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Transact. Asian Lang. Inf. Process. 15, 4 (2016), 22.Google Scholar
- Chooi-Ling Goh, Masayuki Asahara, and Yuji Matsumoto. 2005. Chinese word segmentation by classification of characters. International Journal of the Computational Linguistics and Chinese Language Processing. 10, 3 (2005), 381--396.Google Scholar
- Choochart Haruechaiyasak and Sarawoot Kongyoung. 2009. TLex: Thai lexeme analyser based on the conditional random fields. In Proceedings of the 8th International Symposium on Natural Language Processing. 1--5.Google Scholar
- Choochart Haruechaiyasak, Sarawoot Kongyoung, and Matthew Dailey. 2008. A comparative study on Thai word segmentation approaches. In Proceedings of the Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology. 125--128.Google Scholar
Cross Ref
- Akiko Iijima. 2009. Preliminary notes on “the cultural region of tham script manuscripts. Senri Ethnol. Stud. 74 (2009), 15--32.Google Scholar
- Gregory J D Kourilsky and Vincent Berment. 2005. Towards a computerization of the Lao Tham system of writing. In Proceedings of the 11th International Conference on Lao Studies. 1--21.Google Scholar
- Phoemporn Lakkhanawannakun and Pusadee Seresangtakul. 2014. Improved isarn dharma alphabets to Thai language translation using longest syllable matching with named entities recognition. WIT Trans. Info. Comm. Technol. 59 (2014), 644--635.Google Scholar
- Piya Limcharoen, Cholwich Nattee, and Thanaruk Theeramunkong. 2009. Thai word segmentation based on GLR parsing technique and word n-gram model. In Proceedings of the 8th International Symposium on Natural Language Processing.Google Scholar
- Phaya Luang Maha Sena Phuy. 1943. Learn Fast to Read Tham Character in Lao Texts, Volume 1. Kramol Tirannasur, Bangkok, Thailand. [in Lao]Google Scholar
- Yanjun Ma and Andy Way. 2009. Bilingually motivated word segmentation for statistical machine translation. ACM Transact. Asian Lang. Inf. Process. 8, 2 (2009), 549--557.Google Scholar
- Zin Maung Maung and Yoshiki Mikami. 2008. A rule-based syllable segmentation of Myanmar text. In Proceedings of the IJCNLP’08 Workshop on NLP for Less Privileged Languages. 51--58.Google Scholar
- Nongnud Phaiboon and Pusadee Seresangtakul. 2009. Isarn Dharma alphabet phoneme transcription. In Proceedings of the 13th National on Computer Science and Engineering Conference (NCSEC’09). 287--292.Google Scholar
- Phrakhruwimonsinlapakit. 2016. An analysis of the contexts and the permutationsof the lanna language in the “5 chiang”. Silpak. Univ. J. 36, 1 (2016), 95--110.Google Scholar
- Rashid Rabiya and Seemab Latif. 2012. A dictionary based Urdu word segmentation using maximum matching algorithm for space omission problem. In Proceedings of the Asian Language Processing (IALP’12). 101--104.Google Scholar
- Pojanee Siriaksornsat. 2011. Thai Dialects. Department of Thai and Oriental Languages, Ramkhamhaeng University, Bangkok. [in Thai]Google Scholar
- Sittichai Somsap and Pusadee Seresangtakul. 2013. Isarn Dharma word segmentation. In Proceedings of the International Conference on Control, Automation and Information Sciences (ICCAIS’13). 53--57.Google Scholar
Cross Ref
- Richard Sproat, William Gale, Chilin Shih, and Nancy Chang. 1996. A stochastic finite-state word-segmentation algorithm for chinese. Comput. Ling. 22, 3 (1996), 377--404.Google Scholar
Digital Library
- Wat Srisawang, 2010. Esan Literary Works. Retrieved from http://www.esansawang.in.th. [in Thai]Google Scholar
- Arounyadeth Srithirath and Pusadee Seresangtakul. 2013. A hybrid approach to Lao word segmentation using longest syllable level matching with named entities recognition. In Proceedings of the 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON’13). 1—5.Google Scholar
Cross Ref
- Adul Tapang. 2000. Northeastern Thai Language and Scripts. Department of Thai and Oriental Languages, Ramkhamhaeng University, Bangkok. [in Thai]Google Scholar
- Sayan Tepdang, Choochart Haruechaiyasak and Rachada Kongkachandra. 2010. Improving Thai word segmentation with named entity recognition. In Proceedings of the 10th International Symposium on Communications and Information Technologies. 940--945.Google Scholar
Cross Ref
- Yupin Thairatananond. 1981. Towards the Design of a Thai Text Syllable Analyzer. Master's thesis, Asian Institute of Technology, Pathum Thani, Thailand.Google Scholar
- Tun Thura Thet, Jin-Cheon Na, and Wunna Ko Ko. 2008. Word segmentation for the Myanmar language. J. Inf. Sci. 34, 5 (2008), 688--704.Google Scholar
Digital Library
- Fumihiko Tsumura. 2009. Magical use of traditional scripts in northeastern Thai villages. Senri Ethnol. Stud. 74 (2009), 63--77.Google Scholar
- Kun Wang, Chengqing Zong, and Keh-Yih Su. 2012. Integrating generative and discriminative character-based models for Chinese word segmentation. ACM Transact. Asian Lang. Inf. Process. 11, 2 (2012), 7.Google Scholar
Digital Library
Index Terms
Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition
Recommendations
Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications
Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian ...
Improving Named Entity Recognition for Morphologically Rich Languages Using Word Embeddings
ICMLA '14: Proceedings of the 2014 13th International Conference on Machine Learning and ApplicationsIn this paper, we addressed the Named Entity Recognition (NER) problem for morphologically rich languages by employing a semi-supervised learning approach based on neural networks. We adopted a fast unsupervised method for learning continuous vector ...
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
This article presents a pragmatic approach to Chinese word segmentation. It differs from most previous approaches mainly in three respects. First, while theoretical linguists have defined Chinese words using various linguistic criteria, Chinese words in ...






Comments