Abstract
We propose an unsupervised word segmentation model, in which for each unlabelled sentence sample, the learning objective is to maximize the generation probability of the sentence given its all possible segmentations. Such a generation probability can be factorized into the likelihood of each possible segment given the context in a recursive way. To capture both the long- and short-term dependencies, we propose to use a bi-directional neural language model to better extract the features of the segment’s context. Two decoding algorithms were also developed to combine the context features from both directions to generate the final segmentation at the inference time, which helps to reconcile word-boundary ambiguities. Experimental results show that our context-sensitive unsupervised segmentation model achieved state-of-the-art at different evaluation settings on various datasets for Chinese, and the comparable result for Thai.
- [1] . 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, Feb. (2003), 1137–1155.Google Scholar
Digital Library
- [2] . 2014. A joint model for unsupervised Chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 854–863.Google Scholar
Cross Ref
- [3] . 2017. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Google Scholar
Cross Ref
- [4] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
- [5] . 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing.Google Scholar
- [6] . 2004. Accessor variety criteria for Chinese word extraction. Computat. Ling. 30, 1 (2004), 75–93.Google Scholar
Digital Library
- [7] . 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 673–680.Google Scholar
Digital Library
- [8] . 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112, 1 (2009), 21–54.Google Scholar
Cross Ref
- [9] . 2019. Switch-LSTMs for multi-criteria chinese word segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 6457–6464.Google Scholar
Digital Library
- [10] . 2019. Effective neural solution for multi-criteria word segmentation. In Smart Intelligent Computing and Applications. Springer, 133–142.Google Scholar
Cross Ref
- [11] . 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [12] . 2007. Chinese word segmentation: A decade review. J. Chinese Inf. Process. 21, 3 (2007), 8–20.Google Scholar
- [13] . 2008. The Fourth International Chinese Language Processing Bakeoff: Chinese word segmentation, named entity recognition and Chinese PoS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing.Google Scholar
- [14] . 2006. Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 428–435.Google Scholar
Digital Library
- [15] . 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).Google Scholar
- [16] . 1999. Unsupervised learning of word boundary with description length gain. EACL 1999: CoNLL-99 Computational Natural Language Learning (1999). Retrieved from: https://aclanthology.org/W99-0701.pdf.Google Scholar
- [17] . 2015. Segmental recurrent neural networks. arXiv preprint arXiv:1511.06018 (2015).Google Scholar
- [18] . 2009. InterBEST 2009: Thai word segmentation workshop. In Proceedings of 8th International Symposium on Natural Language Processing (SNLP’09).Google Scholar
- [19] . 1998. Estimating mixture of Dirichlet process models. J. Computat. Graphic. Statist. 7, 2 (1998), 223–238.Google Scholar
- [20] . 2012. Unsupervized word segmentation: The case for Mandarin Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 383–387.Google Scholar
- [21] . 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- [22] . 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 100–108.Google Scholar
Digital Library
- [23] . 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probabil. Apr 1 (1997), 855–900.Google Scholar
- [24] . 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257–286.Google Scholar
Cross Ref
- [25] . 1990. A statistical method for finding word boundaries in Chinese text. Comput. Process. Chinese Orient. Lang. 4, 4 (1990), 336–351.Google Scholar
- [26] . 2018. Unsupervised neural word segmentation for Chinese via segmental language modeling. arXiv preprint arXiv:1810.03167 (2018).Google Scholar
- [27] . 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 985–992.Google Scholar
Digital Library
- [28] . 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1385–1392.Google Scholar
- [29] . 2014. Unsupervised joint monolingual character alignment and word segmentation. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 1–12.Google Scholar
- [30] . 2015. Inducing word and part-of-speech with Pitman-Yor hidden semi-Markov models. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1774–1782.Google Scholar
Cross Ref
- [31] . 2017. Sequence modeling via segmentations. arXiv preprint arXiv:1702.07463 (2017).Google Scholar
- [32] . 2011. A new unsupervised approach to word segmentation. Computat. Ling. 37, 3 (2011), 421–454.Google Scholar
Digital Library
- [33] . 1986. Computational Studies of the Most Frequent Chinese Words and Sounds, Vol. 3. World Scientific.Google Scholar
- [34] . 2008. An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.Google Scholar
- [35] . 2018. Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In Proceedings of the International Joint Conferences on Artificial Intelligence. 4602–4608.Google Scholar
Cross Ref
Index Terms
Unsupervised Word Segmentation with Bi-directional Neural Language Model
Recommendations
Enhancing recurrent neural network-based language models by word tokenization
Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks ...
Language model based arabic word segmentation
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus ...
Unsupervised neural techniques applied to MR brain image segmentation
Special issue on Advances in Unsupervised Learning Techniques Applied to Biosciences and MedicineThe primary goal of brain image segmentation is to partition a given brain image into different regions representing anatomical structures. Magnetic resonance image (MRI) segmentation is especially interesting, since accurate segmentation in white ...






Comments