skip to main content
research-article

Enhanced Double-Carrier Word Embedding via Phonetics and Writing

Authors Info & Claims
Published:15 January 2020Publication History
Skip Abstract Section

Abstract

Word embeddings, which map words into a unified vector space, capture rich semantic information. From a linguistic point of view, words have two carriers, speech and writing. Yet the most recent word embedding models focus on only the writing carrier and ignore the role of the speech carrier in semantic expressions. However, in the development of language, speech appears before writing and plays an important role in the development of writing. For phonetic language systems, the written forms are secondary symbols of spoken ones. Based on this idea, we carried out our work and proposed double-carrier word embedding (DCWE). We used DCWE to conduct a simulation of the generation order of speech and writing. We trained written embedding based on phonetic embedding. The final word embedding fuses writing and phonetic embedding. To illustrate that our model can be applied to most languages, we selected Chinese, English, and Spanish as examples and evaluated these models through word similarity and text classification experiments.

References

  1. S. Bengio and G. Heigold. 2014. Word embeddings for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing 2014.Google ScholarGoogle Scholar
  2. Y. Bengio, H. Schwenk, J. Senécal, et al. 2003. Neural probabilistic language models. Journal of Machine Learning Research 3, 6 (2003), 1137--1155.Google ScholarGoogle Scholar
  3. L. Bloomfield. 1933. Language. Holt, New York, 1933.Google ScholarGoogle Scholar
  4. P. Bojanowski, E. Grave, A. Joulin, et al. 2016. Enriching word vectors with subword information. ArXiv Preprint Arxiv:1607.04606, 2016.Google ScholarGoogle Scholar
  5. J. A. Botha and P. Blunsom. 2014. Compositional morphology for word representations and language modelling. Computer Science 2014, 1899--1907.Google ScholarGoogle Scholar
  6. S. Cao and W. Lu. 2017. Improving word embeddings with convolutional feature learning and subword information. AAAI. 2017, 3144--3151.Google ScholarGoogle Scholar
  7. S. Cao, W. Lu, J. Zhou, et al. 2018. cw2vec: Learning Chinese word embeddings with stroke n-gram information. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. 2018.Google ScholarGoogle Scholar
  8. X. Chen, L. Xu, Z. Liu, et al. 2015. Joint learning of character and word embeddings.In Proceedings of the International Conference on Artificial Intelligence. AAAI Press, 2015:1236--1242.Google ScholarGoogle Scholar
  9. Y. C. Chen, S. F. Huang, C. H. Shen, et al. 2018. Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval. Arxiv Preprint Arxiv:1807.08089, 2018.Google ScholarGoogle Scholar
  10. R. Collobert and J. Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning. ACM, 160--167.Google ScholarGoogle Scholar
  11. M. Etcheverry and D. Wonsever. 2016. Spanish word vectors from Wikipedia. LREC. 2016.Google ScholarGoogle Scholar
  12. Rong-En Fan, Kai-Wei Chang, Cho-JuiHsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. The Journal of Machine Learning Research 9, 1871--1874.Google ScholarGoogle Scholar
  13. L. Finkelstein, E. Gabrilovich, Y. Matias, et al. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web. ACM, 2001, 406--414.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In KDD.Google ScholarGoogle Scholar
  15. S. Hassan and R. Mihalcea. 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarGoogle Scholar
  16. W. He, W. Wang, and K. Livescu. 2016. Multi-view recurrent neural acoustic word embeddings. Arxiv Preprint Arxiv:1611.04496, 2016.Google ScholarGoogle Scholar
  17. P. Jin and Y. Wu. 2012. SemEval-2012 task 4: Evaluating Chinese word similarity. In Proceedings of the Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics 2013, 374--377.Google ScholarGoogle Scholar
  18. A. Joulin, E. Grave, P. Bojanowski, et al. 2016. Bag of tricks for efficient text classification. Arxiv Preprint Arxiv:1607.01759, 2016.Google ScholarGoogle Scholar
  19. A. Jansen, M. Plakal, R. Pandya, et al. 2017. Unsupervised learning of semantic audio representations. ArXiv Preprint ArXiv:1711.02209, 2017.Google ScholarGoogle Scholar
  20. H. Kamper, W. Wang, and K. Livescu. 2016. Deep convolutional acoustic word embeddings using word-pair side information. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016:4950--4954.Google ScholarGoogle Scholar
  21. D. Kiela and S. Clark. 2015. Multi-and cross-modal semantics beyond vision: Grounding in auditory perception. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, 2461--2470.Google ScholarGoogle Scholar
  22. D. Kiela and S. Clark. 2017. Learning neural audio embeddings for grounding semantics in auditory perception. Journal of Artificial Intelligence Research 60 (2017) 1003--1030.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Levin, A. Jansen, and B. Van Durme. 2015. Segmental acoustic indexing for zero resource keyword search. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, 5828--5832.Google ScholarGoogle ScholarCross RefCross Ref
  24. O. Levy and Y. Goldberg. 2014. Dependency-based word embeddings. Annual Meeting of the Association for Computational Linguistics 2014, 302--308.Google ScholarGoogle Scholar
  25. E. Martínez-Cámara, M. C. Díaz-Galiano, M. Á. G. Cumbreras, et al. 2017. Overview of TASS 2017. TASS 2017: Workshop on Sentiment Analysis at SEPLN. 2017.Google ScholarGoogle Scholar
  26. T. Mikolov, K. Chen, G. Corrado, et al. 2013a. Efficient estimation of word representations in vector space. Arxiv Preprint Arxiv:1301.3781. 2013.Google ScholarGoogle Scholar
  27. T. Mikolov, I. Sutskever, K. Chen, et al. 2013b. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 2013, 26:3111--3119.Google ScholarGoogle Scholar
  28. A. Mnih and G. E. Hinton. 2009. A scalable hierarchical distributed language model. Advances in Neural Information Processing Systems. 2009, 1081--1088.Google ScholarGoogle Scholar
  29. J. Pennington, R. Socher, and C. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP'14). 1532--1543.Google ScholarGoogle Scholar
  30. L. Qiu, Y. Cao, Z. Nie, et al. 2014. Learning word representation considering proximity and ambiguity. In Proceedings of the AAAI Conference on Artificial Intelligence 2014, 1572--1578.Google ScholarGoogle Scholar
  31. H. Rubenstein and J. B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM 8, 10 (1965), 627--633Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Ryu, S. Kim, J. Choi, et al. 2017. Neural sentence embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems. Pattern Recognition Letters 2017, 26--32Google ScholarGoogle Scholar
  33. Edward Sapir. 1921. Language: An introduction to the study of speech. Harcourt, Brace, 1921.Google ScholarGoogle Scholar
  34. F. D. Saussure. 1915. Course in General Linguistics. 1915.Google ScholarGoogle Scholar
  35. A. K. Vijayakumar, R. Vedantam, and D. Parikh. 2017. Sound-word2vec: Learning word representations grounded in sounds. ArXiv Preprint Arxiv:1703.01720, 2017.Google ScholarGoogle Scholar
  36. S. Wang, J. Zhang, and C. Zong. 2018. Associative multichannel autoencoder for multimodal word representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018, 115--124.Google ScholarGoogle Scholar
  37. Y. Xu and J. Liu. 2017. Implicitly incorporating morphological information into word embedding. arXiv preprint. 2017:1701.02481.Google ScholarGoogle Scholar
  38. J Xu, J. Liu, L. Zhang, et al. 2016. Improve Chinese word embeddings by exploiting internal structure. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, 1041--1050.Google ScholarGoogle ScholarCross RefCross Ref
  39. J. Yu, X. Jian, H. Xin, et al. 2017. Joint embeddings of Chinese words, characters, and fine-grained subcharacter components. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 286--291.Google ScholarGoogle Scholar
  40. J. Zhang, S. Liu, M. Li, et al. 2014. Bilingually constrained phrase embeddings for machine translation. Annual Meeting of the Association for Computational Linguistics. 2014, 111--121.Google ScholarGoogle ScholarCross RefCross Ref
  41. X. Zhang, J. Zhao, and Y. LeCun, et al. 2015. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems. 2015, 649--657.Google ScholarGoogle Scholar

Index Terms

  1. Enhanced Double-Carrier Word Embedding via Phonetics and Writing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 2
      March 2020
      301 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3358605
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 January 2020
      • Accepted: 1 July 2019
      • Revised: 1 May 2019
      • Received: 1 August 2018
      Published in tallip Volume 19, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!