Abstract
Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models’ reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.
- . 2019. Deep neural network based sentence boundary detection and end marker suggestion for social media text. In 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS'19).Google Scholar
Cross Ref
- . 2015. Detection of sentence boundaries and abbreviations in clinical narratives. BMC Medical Informatics Decision Making, 2015.Google Scholar
- . 2016. Dependency parsing of Tibetan compound sentence. Journal of Chinese Information Processing 30, 6 (2016), 224--229.Google Scholar
- . 2019. Semantic block recognition method for Tibetan sentences. Journal of Chinese Information Processing 33, 6 (2019), 42--49.Google Scholar
- . 2020. Tibetan word segmentation strategy and algorithm based on part-of-speech constraints. Journal of Chinese Information Processing 34, 2 (2020), 33--37.Google Scholar
- Chajia Se, Guocairang Hua, Rangjia Cai, Zhenjiacuo Ci, and Te Rou. 2019. Tibetan poem generation with attention based encoder-decoder model. Journal of Chinese Information Processing 33, 4 (2019), 68--74.Google Scholar
- . 2019. Tibetan syllable segmentation based on mixed mode. Journal of Inner Mongolia Normal University (Natural Science Edition) 48, 5 (2019), 406--412.Google Scholar
- . 1988. The design of a Tibetan spelling checker. International Conference on Chinese Information Processing, 1988.Google Scholar
- Mabao Ban, Zhijie Cai, and Mazhaxi La. 2019. Tibetan interrogative sentences parsing based on PCFG. Journal of Chinese Information Processing 33, 2 (2019), 67--74.Google Scholar
- . 2019. Tibetan sentence boundary recognition based on mixed strategy. Journal of Inner Mongolia Normal University (Natural Science Chinese Edition) 48, 5 (2019), 400--405.Google Scholar
- . 2014. Research on Rule-Based Analysis of Tibetan Syntax. Qinghai University for Nationalities. 2014.Google Scholar
- . 2007. A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology Informatics 129, 1 (2007), 524--528.Google Scholar
- . 2012. Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters, 985--994.Google Scholar
- . 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics 1989, 339--352.Google Scholar
Digital Library
- . 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23, 2, 242--267.Google Scholar
- . 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the 5th Conference on Applied Natural Language Processing, 16--19.Google Scholar
Digital Library
- . 2009. Sentence boundary detection and the problem with the US. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 241--244.Google Scholar
- . 2002. Periods, capitalized words, etc. Computational Linguistics 28, 3 (2002), 289--318.Google Scholar
Digital Library
- . 2000. Tagging sentence boundaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 264--271.Google Scholar
Digital Library
- . 2016. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32, 4 (2016), 485--525.Google Scholar
Digital Library
- 2005. Researches of speech classification methods based on Tibetan repertoire. Journal of Northwest University for Nationalities 26, 2 (2005), 39--42.Google Scholar
- . 2014. Research on automatic recognition method of Tibetan sentence boundary. China Computer and Communication. 8 (2014), 62--63.Google Scholar
- . 2012. Research on the automatic identification of Tibetan sentence boundaries with maximum entropy classifier. Computer Engineering & Science 34, 6 (2012), 187--190.Google Scholar
- . 2011. A maximum entropy and rules approach to identifying Tibetan sentence boundaries. Journal of Chinese Information Processing 25, 4 (2011), 39--45.Google Scholar
- . 2010. The Tibetan sentence boundary identification based on legal texts. In Proceedings of National Symposium on Computational Linguistics for Young People. (YWCL'10).Google Scholar
- . 2012. Method of identification of Tibetan sentence boundary. Journal of Tibet University 27, 2 (2012), 70--76.Google Scholar
- . 2018. Tibetan sentence extraction method based on feature of function words and sentence ending words. Journal of Northwest Minzu University 39, 112 (2018), 39--44.Google Scholar
- . 2016. Detecting sentence boundaries in Sanskrit texts. In Proceedings of the 26th International Conference on Computational Linguistics, COLING 2016. 288--297.Google Scholar
- . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.Google Scholar
Digital Library
- . 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. EMNLP 2017.Google Scholar
Cross Ref
- . 2018. DiSAN: Directional self-attention network for RNN/CNN free language understanding. The 2018 AAAI Conference on Artificial Intelligence. AAAI 2018.Google Scholar
Cross Ref
- . 2017. A structured self-attentive sentence embedding. The 5th International Conference on Learning Representations (ICLR'17).Google Scholar
- . Efficient estimation of word representations in vector space. Computer Science, 2013.Google Scholar
- . 2018. Sentence boundary detection for French with subword-level information vectors and convolutional neural networks, 2018.Google Scholar
- . 2018. A bidirectional LSTM approach with word embeddings for sentence boundary detection. Journal of Signal Processing Systems 90 (2018), 1063--1075.Google Scholar
Digital Library
- . 2014. Convolutional neural networks for sentence classification. In Proceedings of the 19th Empirical Methods in Natural Language Processing (EMNLP'14).Google Scholar
- . 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence 2016 (IJCAI'16).Google Scholar
- . 2016. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics 2 (2016), 427--431.Google Scholar
- . 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, (2017), 1.Google Scholar
Cross Ref
- . 2017. Attention is all you need. In Proceedings of 31st Conference on Neural Information Processing Systems (NIPS’17).Google Scholar
Digital Library
- . 2005. Europarl: A parallel corpus for statistical machine translation. In Proceeding of the 10th Machine Translation Summit (MT summit), 79--86.Google Scholar
- Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The IWSLT 2015 evaluation campaign. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT'15), Da Nang, Vietnam.Google Scholar
Index Terms
Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level
Recommendations
Sentence boundary disambiguation for Indonesian language
iiWAS '17: Proceedings of the 19th International Conference on Information Integration and Web-based Applications & ServicesSentence boundary detection is essential for natural language processing (NLP). Sentence boundary detection in the Indonesian language has lots of problems, which includes punctuation, abbreviation, and character in the bracket. The disambiguation ...
Adaptive multilingual sentence boundary disambiguation
The sentence is a standard textual unit in natual language processing applications. In many language the punctuation mark that indicates the end-of-sentence boundary is ambiguous; thus the tokenizers of most NLP systems must be equipped with special ...
Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning
Cross-lingual sentence embedding’s goal is mapping sentences with similar semantics but in different languages close together and dissimilar sentences farther apart in the representation space. It is the basis of many downstream tasks such as cross-...






Comments