skip to main content
research-article

Composing Word Embeddings for Compound Words Using Linguistic Knowledge

Published:30 March 2023Publication History
Skip Abstract Section

Abstract

In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word boundaries in Japanese are unspecific because Japanese does not have delimiters between words, e.g., “ぶどう狩り” (grape picking) is one word according to one dictionary, whereas “ぶどう” and “狩り” are different words according to another dictionary. This study describes an attempt to compose word embeddings of a compound word from its constituent words in Japanese. We used “short unit” and “long unit,” both of which are the units of terms in UniDic—a Japanese dictionary compiled by the National Institute for Japanese Language and Linguistics—for constituent and compound words, respectively. Furthermore, we composed a word embedding of a compound word from the word embeddings of two constituent words using a neural network. The training data for the word embedding of compound words was created using a corpus generated by concatenating the corpora divided by constituent and compound words. We propose using linguistic knowledge for compositing word embedding to demonstrate how it improves the composition performance. We compared cosine similarity between composed and correct word embeddings of compound words to assess models with and without linguistic knowledge. Furthermore, we evaluated our methods by the ranking of synonyms using a thesaurus. We compared several frameworks and algorithms that use three types of linguistic knowledge—semantic patterns, parts of speech patterns, and compositionality score—and then investigated which linguistic knowledge improves the composition performance. The experiments demonstrated that the multitask models with the classification task of the parts of speech patterns and the estimation task of compositionality scores achieved high performances.

REFERENCES

  1. [1] Baroni Marco and Zamparelli Roberto. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 11831193. https://aclanthology.org/D10-1115.pdf.Google ScholarGoogle Scholar
  2. [2] Hashimoto Kazuma and Tsuruoka Yoshimasa. 2015. Learning embeddings for transitive verb disambiguation by implicit tensor factorization. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and Their Compositionality. 111. https://aclanthology.org/W15-4001.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Hashimoto Kazuma and Tsuruoka Yoshimasa. 2016. Adaptive joint learning of compositional and non-compositional phrase embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 205215. http://arxiv.org/abs/1603.06067.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Hirabayashi Teruo, Komiya Kanako, Asahara Masayuki, and Shinnou Hiroyuki. 2020. Composing word vectors for Japanese compound words using bilingual word embeddings. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 404–410. https://aclanthology.org/2020.paclic-1.46.pdf.Google ScholarGoogle Scholar
  5. [5] Hiraoka Tatsuya, Takase Sho, Uchiumi Kei, Keyaki Atsushi, and Okazaki Naoaki. 2020. Optimizing word segmentation for downstream task. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 13411351. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Hiraoka Tatsuya, Takase Sho, Uchiumi Kei, Keyaki Atsushi, and Okazaki Naoaki. 2021. Joint optimization of tokenization and downstream model. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 244255. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Hisamoto Sorami, Yamamura Takashi, Katsuta Akihiko, Takebayashi Yuto, Takaoka Kazuma, Uchida Yoshitake, Oka Teruaki, and Asahara Masayuki. 2020. chiVe: Towards industrial-strength Japanese word vector resources. In Proceedings of the 16th Text Analytic Symposium.4045.Google ScholarGoogle Scholar
  8. [8] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (Nov. 1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Hong Jimin, Kim TaeHee, Lim Hyesu, and Choo Jaegul. 2021. AVocaDo: Strategy for adapting vocabulary to downstream domain. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 46924700. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Kertkeidkachorn Natthawut and Ichise Ryutaro. 2017. Estimating distributed representations of compound words using recurrent neural network. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. 235246.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Komiya Kanako, Seitou Takumi, Sasaki Minoru, and Shinnou Hiroyuki. 2019. Composing word vectors for Japanese compound words using dependency relations. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLING’19). 1–7.Google ScholarGoogle Scholar
  12. [12] Komiya Kanako, Yaginuma Daiki, Asahara Masayuki, and Shinnou Hiroyuki. 2020. Generation and evaluation of concept embeddings via fine-tuning using automatically tagged corpus. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 122–128. https://aclanthology.org/2020.paclic-1.15.pdfGoogle ScholarGoogle Scholar
  13. [13] Le Quoc and Mikolov Tomas. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research), Xing Eric P. and Jebara Tony (Eds.), Vol. 32. PMLR, Bejing, China, 11881196. https://proceedings.mlr.press/v32/le14.html.Google ScholarGoogle Scholar
  14. [14] Maekawa Kikuo, Yamazaki Makoto, Maruyama Takehiko, Yamaguchi Masaya, Ogura Hideki, Kashino Wakako, Ogiso Toshinobu, Koiso Hanae, and Den Yasuharu. 2010. Design, compilation, and preliminary analyses of Balanced Corpus of Contemporary Written Japanese. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 14831486. http://www.lrec-conf.org/proceedings/lrec2010/pdf/99_Paper.pdf.Google ScholarGoogle Scholar
  15. [15] Maekawa Kikuo, Yamazaki Makoto, Ogiso Toshinobu, Maruyama Takehiko, Ogura Hideki, Kashino Wakako, Koiso Hanae, Yamaguchi Masaya, Tanaka Makiro, and Den Yasuharu. 2014. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation 48, 2 (2014), 345371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13). 112. https://arxiv.org/pdf/1301.3781.pdf.Google ScholarGoogle Scholar
  17. [17] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13). 31113119. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.Google ScholarGoogle Scholar
  18. [18] Mikolov Tomas, Yih Wen tau, and Zweig Geoffrey. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’13). 746751. https://aclanthology.org/N13-1090.pdf.Google ScholarGoogle Scholar
  19. [19] Muraoka Masayasu, Shimaoka Sonse, Yamamoto Kazeto, Watanabe Yotaro, Okazaki Naoaki, and Inui Kentaro. 2014. Finding the best model among representative compositional models. In Proceedings of the 28th Pacific Asia Conference on Language, Information, and Computation (PACLIC’14). 6574. https://aclanthology.org/Y14-1010.pdf.Google ScholarGoogle Scholar
  20. [20] Language National Institute for Japanese and Linguistics. 1964. Word List by Semantic Principles [in Japanese]. Shuuei Shuppan.Google ScholarGoogle Scholar
  21. [21] Pinter Yuval, Guthrie Robert, and Eisenstein Jacob. 2017. Mimicking word embeddings using subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17).102112. https://aclanthology.org/D17-1010.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Socher Richard, Bauer John, Manning Christopher D., and Ng Andrew Y.. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 455465. https://nlp.stanford.edu/pubs/SocherBauerManningNg_ACL2013.pdf.Google ScholarGoogle Scholar
  23. [23] Socher Richard, Karpathy Andrej, Le Quoc V., Manning Christopher D., and Ng Andrew Y.. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207218. https://aclanthology.org/Q14-1017.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Tanaka Hirotaka and Shinnou Hiroyuki. 2022. Vocabulary enhancement of compound words of BERT for domain adaptations. In Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing (NLP’22).9981002. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/PT2-8.pdf.Google ScholarGoogle Scholar
  25. [25] Vashishth Shikhar, Bhandari Manik, Yadav Prateek, Rai Piyush, Bhattacharyya Chiranjib, and Talukdar Partha. 2019. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3308–3318. https://aclanthology.org/P19-1320.pdf.Google ScholarGoogle Scholar
  26. [26] Wu Yirui, Guo Haifeng, Chakraborty Chinmay, Khosravi Mohammad, Berretti Stefano, and Wan Shaohua. 2022. Edge computing driven low-light image dynamic enhancement for object detection. IEEE Transactions on Network Science and Engineering. Early access, February 14, 2022. Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Wu Yirui, Ma Yuntao, and Wan Shaohua. 2021. Multi-scale relation reasoning for multi-modal visual question answering. Signal Processing: Image Communication 96 (2021), 116319. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Wu Yirui, Mao Wenqin, and Feng Jun. 2021. AI for online customer service: Intent recognition and slot filling based on deep learning technology. Mobile Networks and Applications 2021 (2021), 15728153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Yamada Ikuya, Asai Akari, Shindo Hiroyuki, Takeda Hideaki, and Matsumoto Yuji. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 64426454. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Yao Yunzhi, Huang Shaohan, Wang Wenhui, Dong Li, and Wei Furu. 2021. Adapt-and-Distill: Developing small, fast and effective pretrained language models for domains. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 460470. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Composing Word Embeddings for Compound Words Using Linguistic Knowledge

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
      February 2023
      624 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3572719
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 March 2023
      • Online AM: 7 September 2022
      • Accepted: 29 August 2022
      • Revised: 22 August 2022
      • Received: 7 October 2021
      Published in tallip Volume 22, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)149
      • Downloads (Last 6 weeks)12

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!