skip to main content
research-article

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

Authors Info & Claims
Published:22 February 2020Publication History
Skip Abstract Section

Abstract

In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tones and meanings in the same writing text. To overcome these problems, we developed an Isarn Dharma character cluster–(IDCC) based statistical model and affixation and integrated it with the named entity recognition method (IDCC-C-based statistical model and affixation with named entity recognition (NER)). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model, based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best F-measure of 92.19, an F-measure that was better than the IDCC longest matching grouping at 2.85.

References

  1. Cyril Allauzen, Michael Riley, and Brian Roark. 2016. Distributed representation and estimation of WFST-based n-gram models. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. 32--41.Google ScholarGoogle ScholarCross RefCross Ref
  2. Wirote Aroonmanakun. 2002. Collocation and Thai word segmentation. In Proceedings of the 5th SNLP and 5th Oriental COCOSDA Workshop. 68--75.Google ScholarGoogle Scholar
  3. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 858--867.Google ScholarGoogle Scholar
  4. Paisarn Charoenpornsawat and Tanja Schultz. 2008. Improving word segmentation for Thai speech translation. In Proceedings of the Spoken Language Technology Workshop (SLT’08). 241--244.Google ScholarGoogle ScholarCross RefCross Ref
  5. Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Transact. Asian Lang. Inf. Process. 15, 4 (2016), 22.Google ScholarGoogle Scholar
  6. Chooi-Ling Goh, Masayuki Asahara, and Yuji Matsumoto. 2005. Chinese word segmentation by classification of characters. International Journal of the Computational Linguistics and Chinese Language Processing. 10, 3 (2005), 381--396.Google ScholarGoogle Scholar
  7. Choochart Haruechaiyasak and Sarawoot Kongyoung. 2009. TLex: Thai lexeme analyser based on the conditional random fields. In Proceedings of the 8th International Symposium on Natural Language Processing. 1--5.Google ScholarGoogle Scholar
  8. Choochart Haruechaiyasak, Sarawoot Kongyoung, and Matthew Dailey. 2008. A comparative study on Thai word segmentation approaches. In Proceedings of the Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology. 125--128.Google ScholarGoogle ScholarCross RefCross Ref
  9. Akiko Iijima. 2009. Preliminary notes on “the cultural region of tham script manuscripts. Senri Ethnol. Stud. 74 (2009), 15--32.Google ScholarGoogle Scholar
  10. Gregory J D Kourilsky and Vincent Berment. 2005. Towards a computerization of the Lao Tham system of writing. In Proceedings of the 11th International Conference on Lao Studies. 1--21.Google ScholarGoogle Scholar
  11. Phoemporn Lakkhanawannakun and Pusadee Seresangtakul. 2014. Improved isarn dharma alphabets to Thai language translation using longest syllable matching with named entities recognition. WIT Trans. Info. Comm. Technol. 59 (2014), 644--635.Google ScholarGoogle Scholar
  12. Piya Limcharoen, Cholwich Nattee, and Thanaruk Theeramunkong. 2009. Thai word segmentation based on GLR parsing technique and word n-gram model. In Proceedings of the 8th International Symposium on Natural Language Processing.Google ScholarGoogle Scholar
  13. Phaya Luang Maha Sena Phuy. 1943. Learn Fast to Read Tham Character in Lao Texts, Volume 1. Kramol Tirannasur, Bangkok, Thailand. [in Lao]Google ScholarGoogle Scholar
  14. Yanjun Ma and Andy Way. 2009. Bilingually motivated word segmentation for statistical machine translation. ACM Transact. Asian Lang. Inf. Process. 8, 2 (2009), 549--557.Google ScholarGoogle Scholar
  15. Zin Maung Maung and Yoshiki Mikami. 2008. A rule-based syllable segmentation of Myanmar text. In Proceedings of the IJCNLP’08 Workshop on NLP for Less Privileged Languages. 51--58.Google ScholarGoogle Scholar
  16. Nongnud Phaiboon and Pusadee Seresangtakul. 2009. Isarn Dharma alphabet phoneme transcription. In Proceedings of the 13th National on Computer Science and Engineering Conference (NCSEC’09). 287--292.Google ScholarGoogle Scholar
  17. Phrakhruwimonsinlapakit. 2016. An analysis of the contexts and the permutationsof the lanna language in the “5 chiang”. Silpak. Univ. J. 36, 1 (2016), 95--110.Google ScholarGoogle Scholar
  18. Rashid Rabiya and Seemab Latif. 2012. A dictionary based Urdu word segmentation using maximum matching algorithm for space omission problem. In Proceedings of the Asian Language Processing (IALP’12). 101--104.Google ScholarGoogle Scholar
  19. Pojanee Siriaksornsat. 2011. Thai Dialects. Department of Thai and Oriental Languages, Ramkhamhaeng University, Bangkok. [in Thai]Google ScholarGoogle Scholar
  20. Sittichai Somsap and Pusadee Seresangtakul. 2013. Isarn Dharma word segmentation. In Proceedings of the International Conference on Control, Automation and Information Sciences (ICCAIS’13). 53--57.Google ScholarGoogle ScholarCross RefCross Ref
  21. Richard Sproat, William Gale, Chilin Shih, and Nancy Chang. 1996. A stochastic finite-state word-segmentation algorithm for chinese. Comput. Ling. 22, 3 (1996), 377--404.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wat Srisawang, 2010. Esan Literary Works. Retrieved from http://www.esansawang.in.th. [in Thai]Google ScholarGoogle Scholar
  23. Arounyadeth Srithirath and Pusadee Seresangtakul. 2013. A hybrid approach to Lao word segmentation using longest syllable level matching with named entities recognition. In Proceedings of the 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON’13). 1—5.Google ScholarGoogle ScholarCross RefCross Ref
  24. Adul Tapang. 2000. Northeastern Thai Language and Scripts. Department of Thai and Oriental Languages, Ramkhamhaeng University, Bangkok. [in Thai]Google ScholarGoogle Scholar
  25. Sayan Tepdang, Choochart Haruechaiyasak and Rachada Kongkachandra. 2010. Improving Thai word segmentation with named entity recognition. In Proceedings of the 10th International Symposium on Communications and Information Technologies. 940--945.Google ScholarGoogle ScholarCross RefCross Ref
  26. Yupin Thairatananond. 1981. Towards the Design of a Thai Text Syllable Analyzer. Master's thesis, Asian Institute of Technology, Pathum Thani, Thailand.Google ScholarGoogle Scholar
  27. Tun Thura Thet, Jin-Cheon Na, and Wunna Ko Ko. 2008. Word segmentation for the Myanmar language. J. Inf. Sci. 34, 5 (2008), 688--704.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Fumihiko Tsumura. 2009. Magical use of traditional scripts in northeastern Thai villages. Senri Ethnol. Stud. 74 (2009), 63--77.Google ScholarGoogle Scholar
  29. Kun Wang, Chengqing Zong, and Keh-Yih Su. 2012. Integrating generative and discriminative character-based models for Chinese word segmentation. ACM Transact. Asian Lang. Inf. Process. 11, 2 (2012), 7.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 2
      March 2020
      301 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3358605
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 February 2020
      • Accepted: 1 August 2019
      • Revised: 1 July 2019
      • Received: 1 May 2018
      Published in tallip Volume 19, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!