skip to main content
research-article

Unsupervised Word Segmentation with Bi-directional Neural Language Model

Published:25 November 2022Publication History
Skip Abstract Section

Abstract

We propose an unsupervised word segmentation model, in which for each unlabelled sentence sample, the learning objective is to maximize the generation probability of the sentence given its all possible segmentations. Such a generation probability can be factorized into the likelihood of each possible segment given the context in a recursive way. To capture both the long- and short-term dependencies, we propose to use a bi-directional neural language model to better extract the features of the segment’s context. Two decoding algorithms were also developed to combine the context features from both directions to generate the final segmentation at the inference time, which helps to reconcile word-boundary ambiguities. Experimental results show that our context-sensitive unsupervised segmentation model achieved state-of-the-art at different evaluation settings on various datasets for Chinese, and the comparable result for Thai.

REFERENCES

  1. [1] Bengio Yoshua, Ducharme Réjean, Vincent Pascal, and Jauvin Christian. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, Feb. (2003), 11371155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Chen Miaohong, Chang Baobao, and Pei Wenzhe. 2014. A joint model for unsupervised Chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 854863.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Chen Xinchi, Shi Zhan, Qiu Xipeng, and Huang Xuanjing. 2017. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  5. [5] Emerson Thomas. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing.Google ScholarGoogle Scholar
  6. [6] Feng Haodi, Chen Kang, Deng Xiaotie, and Zheng Weimin. 2004. Accessor variety criteria for Chinese word extraction. Computat. Ling. 30, 1 (2004), 7593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Goldwater Sharon, Griffiths Thomas L., and Johnson Mark. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 673680.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Goldwater Sharon, Griffiths Thomas L., and Johnson Mark. 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112, 1 (2009), 2154.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Gong Jingjing, Chen Xinchi, Gui Tao, and Qiu Xipeng. 2019. Switch-LSTMs for multi-criteria chinese word segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 64576464.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] He Han, Wu Lei, Yan Hua, Gao Zhimin, Feng Yi, and Townsend George. 2019. Effective neural solution for multi-criteria word segmentation. In Smart Intelligent Computing and Applications. Springer, 133142.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Huang Changning and Zhao Hai. 2007. Chinese word segmentation: A decade review. J. Chinese Inf. Process. 21, 3 (2007), 820.Google ScholarGoogle Scholar
  13. [13] Jin Guangjin and Chen Xiao. 2008. The Fourth International Chinese Language Processing Bakeoff: Chinese word segmentation, named entity recognition and Chinese PoS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing.Google ScholarGoogle Scholar
  14. [14] Jin Zhihui and Tanaka-Ishii Kumiko. 2006. Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 428435.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Jozefowicz Rafal, Vinyals Oriol, Schuster Mike, Shazeer Noam, and Wu Yonghui. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).Google ScholarGoogle Scholar
  16. [16] Kitt Chunyu and Wilks Yorick. 1999. Unsupervised learning of word boundary with description length gain. EACL 1999: CoNLL-99 Computational Natural Language Learning (1999). Retrieved from: https://aclanthology.org/W99-0701.pdf.Google ScholarGoogle Scholar
  17. [17] Kong Lingpeng, Dyer Chris, and Smith Noah A.. 2015. Segmental recurrent neural networks. arXiv preprint arXiv:1511.06018 (2015).Google ScholarGoogle Scholar
  18. [18] Kosawat Krit. 2009. InterBEST 2009: Thai word segmentation workshop. In Proceedings of 8th International Symposium on Natural Language Processing (SNLP’09).Google ScholarGoogle Scholar
  19. [19] MacEachern Steven N. and Müller Peter. 1998. Estimating mixture of Dirichlet process models. J. Computat. Graphic. Statist. 7, 2 (1998), 223238.Google ScholarGoogle Scholar
  20. [20] Magistry Pierre and Sagot Benoît. 2012. Unsupervized word segmentation: The case for Mandarin Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 383387.Google ScholarGoogle Scholar
  21. [21] Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černockỳ Jan, and Khudanpur Sanjeev. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Mochihashi Daichi, Yamada Takeshi, and Ueda Naonori. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 100108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Pitman Jim and Yor Marc. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probabil. Apr 1 (1997), 855900.Google ScholarGoogle Scholar
  24. [24] Rabiner Lawrence R.. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257286.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Sproat Richard and Shih Chilin. 1990. A statistical method for finding word boundaries in Chinese text. Comput. Process. Chinese Orient. Lang. 4, 4 (1990), 336351.Google ScholarGoogle Scholar
  26. [26] Sun Zhiqing and Deng Zhi-Hong. 2018. Unsupervised neural word segmentation for Chinese via segmental language modeling. arXiv preprint arXiv:1810.03167 (2018).Google ScholarGoogle Scholar
  27. [27] Teh Yee Whye. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 985992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Teh Yee W., Jordan Michael I., Beal Matthew J., and Blei David M.. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 13851392.Google ScholarGoogle Scholar
  29. [29] Teng Zhiyang, Xiong Hao, and Liu Qun. 2014. Unsupervised joint monolingual character alignment and word segmentation. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 112.Google ScholarGoogle Scholar
  30. [30] Uchiumi Kei, Tsukahara Hiroshi, and Mochihashi Daichi. 2015. Inducing word and part-of-speech with Pitman-Yor hidden semi-Markov models. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 17741782.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Wang Chong, Wang Yining, Huang Po-Sen, Mohamed Abdelrahman, Zhou Dengyong, and Deng Li. 2017. Sequence modeling via segmentations. arXiv preprint arXiv:1702.07463 (2017).Google ScholarGoogle Scholar
  32. [32] Wang Hanshi, Zhu Jian, Tang Shiping, and Fan Xiaozhong. 2011. A new unsupervised approach to word segmentation. Computat. Ling. 37, 3 (2011), 421454.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Yee Suen Ching. 1986. Computational Studies of the Most Frequent Chinese Words and Sounds, Vol. 3. World Scientific.Google ScholarGoogle Scholar
  34. [34] Zhao Hai and Kit Chunyu. 2008. An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.Google ScholarGoogle Scholar
  35. [35] Zhao Lujun, Zhang Qi, Wang Peng, and Liu Xiaoyu. 2018. Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In Proceedings of the International Joint Conferences on Artificial Intelligence. 46024608.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Unsupervised Word Segmentation with Bi-directional Neural Language Model

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 1
          January 2023
          340 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3572718
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 November 2022
          • Online AM: 29 April 2022
          • Accepted: 28 March 2022
          • Revised: 15 March 2022
          • Received: 17 June 2021
          Published in tallip Volume 22, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)204
          • Downloads (Last 6 weeks)12

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!