skip to main content
research-article

Topic-Based Unsupervised and Supervised Dictionary Induction

Authors Info & Claims
Published:10 March 2023Publication History
Skip Abstract Section

Abstract

Word translation is a natural language processing task that provides translation between the words of a source and a target language. As a task, it reduces to the induction of a bilingual dictionary, which is typically performed by aligning word embeddings of the source language to word embeddings of the target language. To date, all the existing approaches have focused on performing a single, global alignment in word embedding space. However, semantic differences between the various languages, in addition to differences in the content of the corpora used for training the word embeddings, can hinder the effectiveness of such a global alignment. For this reason, in this article we propose conducting the alignment between the source and target embedding spaces by multiple mappings at topic level. The experimental results show that our approach has been able to achieve an average accuracy improvement of +3.30 percentage points over a state-of-the-art approach in unsupervised dictionary induction from languages as diverse as German, French, Italian, Spanish, Finnish, Turkish, and Chinese to English, and +3.95 points average improvement in supervised dictionary induction.

REFERENCES

  1. [1] Alvarez-Melis David and Jaakkola Tommi S.. 2018. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 18811890.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Artetxe Mikel, Labaka Gorka, and Agirre Eneko. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 451462. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Artetxe Mikel, Labaka Gorka, and Agirre Eneko. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 789798.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Blei David M., Ng Andrew Y., and Jordan Michael I.. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 9931022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135146. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cao Hailong and Zhao Tiejun. 2021. Word embedding transformation for robust unsupervised bilingual lexicon induction. arXiv:2105.12297. https://arxiv.org/abs/2105.12297.Google ScholarGoogle Scholar
  7. [7] Conneau Alexis, Lample Guillaume, Ranzato Marc’Aurelio, Denoyer Ludovic, and Jégou Hervé. 2018. Word translation without parallel data. arXiv:1710.04087. https://arxiv.org/abs/1710.04087.Google ScholarGoogle Scholar
  8. [8] Das Rajarshi, Zaheer Manzil, and Dyer Chris. 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 795804. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Dieng Adji B., Ruiz Francisco J. R., and Blei David M.. 2020. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8 (2020), 439453.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Dinu Georgiana and Baroni Marco. 2015. Improving zero-shot learning by mitigating the hubness problem. CoRR abs/1412.6568 (2015). arXiv:1412.6568. https://arxiv.org/abs/1412.6568.Google ScholarGoogle Scholar
  11. [11] Dror Rotem, Baumer Gili, Shlomov Segev, and Reichart Roi. 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 13831392. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Goodfellow I., Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron C., and Yoshua Bengio. 2014. Generative adversarial networks. arXiv:1406.2661. https://arxiv.org/abs/1406.2661.Google ScholarGoogle Scholar
  13. [13] Grave Edouard, Joulin Armand, and Berthet Quentin. 2018. Unsupervised alignment of embeddings with Wasserstein procrustes. CoRR abs/1805.11222 (2018). arxiv:1805.11222. http://arxiv.org/abs/1805.11222.Google ScholarGoogle Scholar
  14. [14] Mareike Hartmann, Yova Kementchedjhieva, and Anders Søgaard. 2019. Comparing Unsupervised Word Translation Methods Step by Step. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS’19). 6031–6041.Google ScholarGoogle Scholar
  15. [15] Hoshen Yedid and Wolf Lior. 2018. Non-adversarial unsupervised word translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Riloff Ellen, Chiang David, Hockenmaier Julia, and Tsujii Jun’ichi (Eds.). Association for Computational Linguistics, 469478. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] P. Sanjanasri J., Menon Vijay Krishna, P. Soman K., S. Rajendran, and Wolk Agnieszka. 2021. Generation of cross-lingual word vectors for low-resourced languages using deep learning and topological metrics in a data-efficient way. Electronics 10, 12 (2021), 1372:1–23.Google ScholarGoogle Scholar
  17. [17] Kamyab Elias. 2019. Multilingual Gaussian Latent Dirichlet Allocation. Master’s thesis. Chalmers University of Technology and University of Gothenburg, Sweden.Google ScholarGoogle Scholar
  18. [18] Li Yanyang, Luo Yingfeng, Lin Ye, Du Quan, Wang Huizhen, Huang Shujian, Xiao Tong, and Zhu Jingbo. 2020. A simple and effective approach to robust unsupervised bilingual dictionary induction. In Proceedings of the 28th International Conference on Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] McLachlan G. J. and Peel D.. 2000. Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley, New York.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Barone Antonio Valerio Miceli. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. In Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics, 121126. Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Mikolov Tomas, Le Quoc V., and Sutskever Ilya. 2013. Exploiting similarities among languages for machine translation. arXiv:1309.4168. https://arxiv.org/abs/1309.4168.Google ScholarGoogle Scholar
  22. [22] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., Red Hook, NY, 31113119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 15321543. Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Peter Schönemann. 1966. A generalized solution of the orthogonal Procrustes problem. Psychometrika 31, 1 (1966), 110. https://EconPapers.repec.org/RePEc:spr:psycho:v:31:y:1966:i:1:p:1-10.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Smith Samuel L., Turban David H. P., Hamblin Steven, and Hammerla Nils Y.. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. CoRR abs/1702.03859 (2017). arxiv:1702.03859 http://arxiv.org/abs/1702.03859Google ScholarGoogle Scholar
  26. [26] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 25792605. http://jmlr.org/papers/v9/vandermaaten08a.html.Google ScholarGoogle Scholar
  27. [27] Vulić Ivan and Korhonen Anna. 2016. On the role of seed lexicons in learning bilingual word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 247257. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Vulić Ivan, Korhonen Anna, and Glavaš Goran. 2020. Improving bilingual lexicon induction with unsupervised post-processing of monolingual word vector spaces. In Proceedings of the 5th Workshop on Representation Learning for NLP. 4554.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Vulić Ivan and Moens Marie-Francine. 2016. Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55, 1 (Jan.2016), 953994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Xing Chao, Wang Dong, Liu Chao, and Lin Yiye. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 10061011. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Xiong Deyi, Meng Fandong, and Liu Qun. 2016. Topic-based term translation models for statistical machine translation. Artificial Intelligence 232 (2016), 5475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Xiong Deyi and Zhang Min. 2013. A topic-based coherence model for statistical machine translation. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI’13). AAAI Press, 977983.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Zhang Meng, Liu Yang, Luan Huanbo, and Sun Maosong. 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 19591970. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Topic-Based Unsupervised and Supervised Dictionary Induction

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
        March 2023
        570 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3579816
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 March 2023
        • Online AM: 29 September 2022
        • Accepted: 14 September 2022
        • Revised: 25 May 2022
        • Received: 12 October 2021
        Published in tallip Volume 22, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)230
        • Downloads (Last 6 weeks)4

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!