Abstract
In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word boundaries in Japanese are unspecific because Japanese does not have delimiters between words, e.g., “ぶどう狩り” (grape picking) is one word according to one dictionary, whereas “ぶどう” and “狩り” are different words according to another dictionary. This study describes an attempt to compose word embeddings of a compound word from its constituent words in Japanese. We used “short unit” and “long unit,” both of which are the units of terms in UniDic—a Japanese dictionary compiled by the National Institute for Japanese Language and Linguistics—for constituent and compound words, respectively. Furthermore, we composed a word embedding of a compound word from the word embeddings of two constituent words using a neural network. The training data for the word embedding of compound words was created using a corpus generated by concatenating the corpora divided by constituent and compound words. We propose using linguistic knowledge for compositing word embedding to demonstrate how it improves the composition performance. We compared cosine similarity between composed and correct word embeddings of compound words to assess models with and without linguistic knowledge. Furthermore, we evaluated our methods by the ranking of synonyms using a thesaurus. We compared several frameworks and algorithms that use three types of linguistic knowledge—semantic patterns, parts of speech patterns, and compositionality score—and then investigated which linguistic knowledge improves the composition performance. The experiments demonstrated that the multitask models with the classification task of the parts of speech patterns and the estimation task of compositionality scores achieved high performances.
- [1] . Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 1183–1193. https://aclanthology.org/D10-1115.pdf.Google Scholar
- [2] . 2015. Learning embeddings for transitive verb disambiguation by implicit tensor factorization. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and Their Compositionality. 1–11. https://aclanthology.org/W15-4001.pdf.Google Scholar
Cross Ref
- [3] . 2016. Adaptive joint learning of compositional and non-compositional phrase embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 205–215. http://arxiv.org/abs/1603.06067.Google Scholar
Cross Ref
- [4] . 2020. Composing word vectors for Japanese compound words using bilingual word embeddings. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 404–410. https://aclanthology.org/2020.paclic-1.46.pdf.Google Scholar
- [5] . 2020. Optimizing word segmentation for downstream task. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1341–1351. Google Scholar
Cross Ref
- [6] . 2021. Joint optimization of tokenization and downstream model. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 244–255. Google Scholar
Cross Ref
- [7] . 2020. chiVe: Towards industrial-strength Japanese word vector resources. In Proceedings of the 16th Text Analytic Symposium.40–45.Google Scholar
- [8] . 1997. Long short-term memory. Neural Computation 9, 8 (Nov. 1997), 1735–1780. Google Scholar
Digital Library
- [9] . 2021. AVocaDo: Strategy for adapting vocabulary to downstream domain. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4692–4700. Google Scholar
Cross Ref
- [10] . 2017. Estimating distributed representations of compound words using recurrent neural network. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. 235–246.Google Scholar
Cross Ref
- [11] . 2019. Composing word vectors for Japanese compound words using dependency relations. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLING’19). 1–7.Google Scholar
- [12] . 2020. Generation and evaluation of concept embeddings via fine-tuning using automatically tagged corpus. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 122–128. https://aclanthology.org/2020.paclic-1.15.pdfGoogle Scholar
- [13] . 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research), and (Eds.), Vol. 32. PMLR, Bejing, China, 1188–1196. https://proceedings.mlr.press/v32/le14.html.Google Scholar
- [14] . 2010. Design, compilation, and preliminary analyses of Balanced Corpus of Contemporary Written Japanese. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 1483–1486. http://www.lrec-conf.org/proceedings/lrec2010/pdf/99_Paper.pdf.Google Scholar
- [15] . 2014. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation 48, 2 (2014), 345–371.Google Scholar
Digital Library
- [16] . Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13). 1–12. https://arxiv.org/pdf/1301.3781.pdf.Google Scholar
- [17] . Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13). 3111–3119. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.Google Scholar
- [18] . 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’13). 746–751. https://aclanthology.org/N13-1090.pdf.Google Scholar
- [19] . 2014. Finding the best model among representative compositional models. In Proceedings of the 28th Pacific Asia Conference on Language, Information, and Computation (PACLIC’14). 65–74. https://aclanthology.org/Y14-1010.pdf.Google Scholar
- [20] . 1964. Word List by Semantic Principles [in Japanese]. Shuuei Shuppan.Google Scholar
- [21] . 2017. Mimicking word embeddings using subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17).102–112. https://aclanthology.org/D17-1010.pdf.Google Scholar
Cross Ref
- [22] . 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 455–465. https://nlp.stanford.edu/pubs/SocherBauerManningNg_ACL2013.pdf.Google Scholar
- [23] . 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218. https://aclanthology.org/Q14-1017.pdf.Google Scholar
Cross Ref
- [24] . 2022. Vocabulary enhancement of compound words of BERT for domain adaptations. In Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing (NLP’22).998–1002. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/PT2-8.pdf.Google Scholar
- [25] . 2019. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3308–3318. https://aclanthology.org/P19-1320.pdf.Google Scholar
- [26] . 2022. Edge computing driven low-light image dynamic enhancement for object detection. IEEE Transactions on Network Science and Engineering. Early access, February 14, 2022. Google Scholar
Cross Ref
- [27] . 2021. Multi-scale relation reasoning for multi-modal visual question answering. Signal Processing: Image Communication 96 (2021), 116319. Google Scholar
Cross Ref
- [28] . 2021. AI for online customer service: Intent recognition and slot filling based on deep learning technology. Mobile Networks and Applications 2021 (2021), 1572–8153. Google Scholar
Digital Library
- [29] . 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6442–6454. Google Scholar
Cross Ref
- [30] . 2021. Adapt-and-Distill: Developing small, fast and effective pretrained language models for domains. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 460–470. Google Scholar
Cross Ref
Index Terms
Composing Word Embeddings for Compound Words Using Linguistic Knowledge
Recommendations
Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement
Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect ...
Improving Vietnamese WordNet using word embedding
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalThis paper presents a simple but effective method to improve the quality of WordNet synsets and extract glosses for synsets. We translate the Princeton WordNet and other intermediate WordNets to a target language using a machine translator, then the ...
Hindi Word Sense Disambiguation Using Lesk Approach on Bigram and Trigram Words
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & ComputingWord Sense Disambiguation (WSD) is a vital task which provides the definition of particular words according to their sense or according to given context. Lesk algorithm is originally based on the gloss overlap that can be observed as the measure, ...






Comments