skip to main content
research-article

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

Published:09 January 2019Publication History
Skip Abstract Section

Abstract

This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high-resource languages, this disadvantage might not be effective due to the high frequency of co-occurrence. However, when there are not enough available resources, such pairs suffer from being distant. To favour such pairs, a weighting scheme based on a polynomial fitting procedure is proposed to shift the weights up for distant words while the weights of nearby words are left almost unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for the English, Italian, and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes, as if these languages are low-resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e., mimicking low-resource languages) of both English and Italian. To show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low-resource language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as a corpus. The positive effect of the proposed scheme has also been demonstrated in a standard sentiment analysis task for the Turkish language.

References

  1. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 19--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  3. Giacomo Berardi, Andrea Esuli, and Diego Marcheggiani. 2015. Word embeddings go to Italy: A comparison of models and training datasets. In Proceedings of the Italian Information Retrieval Workshop.Google ScholarGoogle Scholar
  4. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016).Google ScholarGoogle Scholar
  5. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49, 2014 (2014), 1--47. Google ScholarGoogle ScholarCross RefCross Ref
  6. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391.Google ScholarGoogle ScholarCross RefCross Ref
  7. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014).Google ScholarGoogle Scholar
  8. John R. Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis (1957).Google ScholarGoogle Scholar
  9. Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alexander G. Huth, Wendy A. de Heer, Thomas L. Griffiths, Frédéric E. Theunissen, and Jack L. Gallant. 2016. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 7600 (2016), 453--458.Google ScholarGoogle ScholarCross RefCross Ref
  11. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomas Mikolov. 2016b. FastText. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).Google ScholarGoogle Scholar
  12. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016a. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).Google ScholarGoogle Scholar
  13. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  14. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. 302--308.Google ScholarGoogle ScholarCross RefCross Ref
  16. Pierre Lison and Andrei Kutuzov. 2017. Redefining context windows for word embedding models: An experimental study. arXiv preprint arXiv:1411.4166 (2017).Google ScholarGoogle Scholar
  17. Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the Computational Natural Language Learning Conference (CoNLL’13). 104--113.Google ScholarGoogle Scholar
  18. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  19. Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).Google ScholarGoogle Scholar
  20. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 13. 746--751.Google ScholarGoogle Scholar
  21. George Miller and Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge.Google ScholarGoogle Scholar
  22. George A Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Vol. 14. 1532--1543.Google ScholarGoogle Scholar
  24. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web. ACM, 337--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Peter Mark Roget. 1911. Roget’s Thesaurus of English Words and Phrases. TY Crowell Company.Google ScholarGoogle Scholar
  26. Peter Mark Roget. 2008. Roget’s International Thesaurus, 3/E. Oxford and IBH Publishing.Google ScholarGoogle Scholar
  27. Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lütfi Kerem Şenel, Veysel Yücesoy, Aykut Koç, and Tolga Çukur. 2017a. Measuring cross-lingual semantic similarity across European languages. In Proceedings of the Conference on Telecommunications and Signal Processing. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  29. Lütfi Kerem Şenel, Veysel Yücesoy, Aykut Koç, and Tolga Çukur. 2017b. Semantic similarity between Turkish and European languages using word embeddings. In Proceedings of the Signal Processing and Communications Applications Conference (SIU’17). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  30. Christopher Tillmann and Hermann Ney. 1997. Word triggers and the EM algorithm. In Proceedings of the Computational Natural Language Learning Conference (CoNLL’97). 117--124.Google ScholarGoogle Scholar
  31. Veysel Yücesoy and Aykut Koç. 2016. Effect of the training set on the word embeddings and similarity test set for Turkish. In Proceedings of the Signal Processing and Communication Application Conference (SIU’16). IEEE, 1005--1008.Google ScholarGoogle ScholarCross RefCross Ref
  32. Veysel Yücesoy and Aykut Koç. 2017. Effect of cooccurance weighting to English word embeddings. In Proceedings of the Signal Processing and Communications Applications Conference (SIU’17). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 3
          September 2019
          386 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3305347
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 January 2019
          • Accepted: 1 September 2018
          • Revised: 1 July 2018
          • Received: 1 October 2017
          Published in tallip Volume 18, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!