skip to main content
research-article

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

Authors Info & Claims
Published:21 February 2023Publication History
Skip Abstract Section

Abstract

Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models’ reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.

REFERENCES

  1. Kaur Jagroop and Singh Jaswinder. 2019. Deep neural network based sentence boundary detection and end marker suggestion for social media text. In 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS'19).Google ScholarGoogle ScholarCross RefCross Ref
  2. Markus Kreuzthaler and Schulz Stefan. 2015. Detection of sentence boundaries and abbreviations in clinical narratives. BMC Medical Informatics Decision Making, 2015.Google ScholarGoogle Scholar
  3. QuecairangHua and Zhao Haixing. 2016. Dependency parsing of Tibetan compound sentence. Journal of Chinese Information Processing 30, 6 (2016), 224--229.Google ScholarGoogle Scholar
  4. Rou Te, Se Chajia, and Cai Rangjia. 2019. Semantic block recognition method for Tibetan sentences. Journal of Chinese Information Processing 33, 6 (2019), 42--49.Google ScholarGoogle Scholar
  5. Cai Rangzhuoma and Cai Zhijie. 2020. Tibetan word segmentation strategy and algorithm based on part-of-speech constraints. Journal of Chinese Information Processing 34, 2 (2020), 33--37.Google ScholarGoogle Scholar
  6. Chajia Se, Guocairang Hua, Rangjia Cai, Zhenjiacuo Ci, and Te Rou. 2019. Tibetan poem generation with attention based encoder-decoder model. Journal of Chinese Information Processing 33, 4 (2019), 68--74.Google ScholarGoogle Scholar
  7. Cai Rangdangzhi and Hua Quecairang. 2019. Tibetan syllable segmentation based on mixed mode. Journal of Inner Mongolia Normal University (Natural Science Edition) 48, 5 (2019), 406--412.Google ScholarGoogle Scholar
  8. Tashi Tsering. 1988. The design of a Tibetan spelling checker. International Conference on Chinese Information Processing, 1988.Google ScholarGoogle Scholar
  9. Mabao Ban, Zhijie Cai, and Mazhaxi La. 2019. Tibetan interrogative sentences parsing based on PCFG. Journal of Chinese Information Processing 33, 2 (2019), 67--74.Google ScholarGoogle Scholar
  10. Que Cuozhuoma, Hua Quecairang, Cai Rangdangzhi, and Xia Wuji. 2019. Tibetan sentence boundary recognition based on mixed strategy. Journal of Inner Mongolia Normal University (Natural Science Chinese Edition) 48, 5 (2019), 400--405.Google ScholarGoogle Scholar
  11. Wan Mecairang. 2014. Research on Rule-Based Analysis of Tibetan Syntax. Qinghai University for Nationalities. 2014.Google ScholarGoogle Scholar
  12. Tomanek Katrin, Wermter Joachim, and Hahn Udo. 2007. A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology Informatics 129, 1 (2007), 524--528.Google ScholarGoogle Scholar
  13. Read Jonathon, Driden Rebecca, Oepen Stephan, and Jørgen Solberg Lars. 2012. Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters, 985--994.Google ScholarGoogle Scholar
  14. Riley Michael D.. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics 1989, 339--352.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Palmer David D. and Hearst Marti A.. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23, 2, 242--267.Google ScholarGoogle Scholar
  16. Reynar Jeffrey C. and Ratnaparkhi Adwait. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the 5th Conference on Applied Natural Language Processing, 16--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gillick Daniel. 2009. Sentence boundary detection and the problem with the US. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 241--244.Google ScholarGoogle Scholar
  18. Mikheev A.. 2002. Periods, capitalized words, etc. Computational Linguistics 28, 3 (2002), 289--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mikheev A.. 2000. Tagging sentence boundaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 264--271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kiss Tibor and Strunk Jan. 2016. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32, 4 (2016), 485--525.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cai Rangjia and Ji Taijia 2005. Researches of speech classification methods based on Tibetan repertoire. Journal of Northwest University for Nationalities 26, 2 (2005), 39--42.Google ScholarGoogle Scholar
  22. Ren Qingji and An Jiancairang. 2014. Research on automatic recognition method of Tibetan sentence boundary. China Computer and Communication. 8 (2014), 62--63.Google ScholarGoogle Scholar
  23. Cai Zangtai. 2012. Research on the automatic identification of Tibetan sentence boundaries with maximum entropy classifier. Computer Engineering & Science 34, 6 (2012), 187--190.Google ScholarGoogle Scholar
  24. Li Xiang, Cai Zangtai, Wenbin Jiang, Lv Yajuan, and Liu Qun. 2011. A maximum entropy and rules approach to identifying Tibetan sentence boundaries. Journal of Chinese Information Processing 25, 4 (2011), 39--45.Google ScholarGoogle Scholar
  25. Zhao Weina, Liu Huidan, Yu Xin, Wu Jian, and Zang Pu. 2010. The Tibetan sentence boundary identification based on legal texts. In Proceedings of National Symposium on Computational Linguistics for Young People. (YWCL'10).Google ScholarGoogle Scholar
  26. Ma Weizhen, Wan Mezhaxi, and Nima Zha. 2012. Method of identification of Tibetan sentence boundary. Journal of Tibet University 27, 2 (2012), 70--76.Google ScholarGoogle Scholar
  27. Zha Xiji and Luo Ba. 2018. Tibetan sentence extraction method based on feature of function words and sentence ending words. Journal of Northwest Minzu University 39, 112 (2018), 39--44.Google ScholarGoogle Scholar
  28. Hellwig Oliver. 2016. Detecting sentence boundaries in Sanskrit texts. In Proceedings of the 26th International Conference on Computational Linguistics, COLING 2016. 288--297.Google ScholarGoogle Scholar
  29. Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Reimers Nils and Gurevych Iryna. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. EMNLP 2017.Google ScholarGoogle ScholarCross RefCross Ref
  31. Shen Tao, Zhou Tianyi, Long Guodong, Jiang Jing, Pan Shirui, and Zhang Chengqi. 2018. DiSAN: Directional self-attention network for RNN/CNN free language understanding. The 2018 AAAI Conference on Artificial Intelligence. AAAI 2018.Google ScholarGoogle ScholarCross RefCross Ref
  32. Han Linzhou, Feng Minwei, Santos Cicero Nogueira dos, Yu Mo, Xiang Bing, Zhou Bowen, and Bengio Yoshua. 2017. A structured self-attentive sentence embedding. The 5th International Conference on Learning Representations (ICLR'17).Google ScholarGoogle Scholar
  33. Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. Efficient estimation of word representations in vector space. Computer Science, 2013.Google ScholarGoogle Scholar
  34. Emiliano Carlos, Gallardo González, Manuel Juan, and Moreno Torres. 2018. Sentence boundary detection for French with subword-level information vectors and convolutional neural networks, 2018.Google ScholarGoogle Scholar
  35. Xu Chenglin, Xie Lei, and Xiao Xiong. 2018. A bidirectional LSTM approach with word embeddings for sentence boundary detection. Journal of Signal Processing Systems 90 (2018), 1063--1075.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 19th Empirical Methods in Natural Language Processing (EMNLP'14).Google ScholarGoogle Scholar
  37. Liu Pengfei, Qiu Xipeng, and Huang Xuanjing. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence 2016 (IJCAI'16).Google ScholarGoogle Scholar
  38. Joulin Armand, Grave Edouard, Bojanowski Piotr, and Mikolov Tomas. 2016. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics 2 (2016), 427--431.Google ScholarGoogle Scholar
  39. Johnson Rie and Zhang Tong. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, (2017), 1.Google ScholarGoogle ScholarCross RefCross Ref
  40. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of 31st Conference on Neural Information Processing Systems (NIPS’17).Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Koehn Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceeding of the 10th Machine Translation Summit (MT summit), 79--86.Google ScholarGoogle Scholar
  42. Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The IWSLT 2015 evaluation campaign. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT'15), Da Nang, Vietnam.Google ScholarGoogle Scholar

Index Terms

  1. Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 6
          November 2022
          372 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3568970
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 February 2023
          • Online AM: 1 April 2022
          • Accepted: 18 March 2022
          • Received: 9 August 2020
          Published in tallip Volume 21, Issue 6

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!