Abstract
Word translation is a natural language processing task that provides translation between the words of a source and a target language. As a task, it reduces to the induction of a bilingual dictionary, which is typically performed by aligning word embeddings of the source language to word embeddings of the target language. To date, all the existing approaches have focused on performing a single, global alignment in word embedding space. However, semantic differences between the various languages, in addition to differences in the content of the corpora used for training the word embeddings, can hinder the effectiveness of such a global alignment. For this reason, in this article we propose conducting the alignment between the source and target embedding spaces by multiple mappings at topic level. The experimental results show that our approach has been able to achieve an average accuracy improvement of +3.30 percentage points over a state-of-the-art approach in unsupervised dictionary induction from languages as diverse as German, French, Italian, Spanish, Finnish, Turkish, and Chinese to English, and +3.95 points average improvement in supervised dictionary induction.
- [1] . 2018. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1881–1890.Google Scholar
Cross Ref
- [2] . 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 451–462. Google Scholar
Cross Ref
- [3] . 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 789–798.Google Scholar
Cross Ref
- [4] . 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993–1022.Google Scholar
Digital Library
- [5] . 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146. Google Scholar
Cross Ref
- [6] . 2021. Word embedding transformation for robust unsupervised bilingual lexicon induction. arXiv:2105.12297. https://arxiv.org/abs/2105.12297.Google Scholar
- [7] . 2018. Word translation without parallel data. arXiv:1710.04087. https://arxiv.org/abs/1710.04087.Google Scholar
- [8] . 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 795–804. Google Scholar
Cross Ref
- [9] . 2020. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8 (2020), 439–453.Google Scholar
Cross Ref
- [10] . 2015. Improving zero-shot learning by mitigating the hubness problem. CoRR abs/1412.6568 (2015). arXiv:1412.6568. https://arxiv.org/abs/1412.6568.Google Scholar
- [11] . 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1383–1392. Google Scholar
Cross Ref
- [12] . 2014. Generative adversarial networks. arXiv:1406.2661. https://arxiv.org/abs/1406.2661.Google Scholar
- [13] . 2018. Unsupervised alignment of embeddings with Wasserstein procrustes. CoRR abs/1805.11222 (2018).
arxiv:1805.11222 . http://arxiv.org/abs/1805.11222.Google Scholar - [14] Mareike Hartmann, Yova Kementchedjhieva, and Anders Søgaard. 2019. Comparing Unsupervised Word Translation Methods Step by Step. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS’19). 6031–6041.Google Scholar
- [15] . 2018. Non-adversarial unsupervised word translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, , , , and (Eds.). Association for Computational Linguistics, 469–478. Google Scholar
Cross Ref
- [16] . 2021. Generation of cross-lingual word vectors for low-resourced languages using deep learning and topological metrics in a data-efficient way. Electronics 10, 12 (2021), 1372:1–23.Google Scholar
- [17] . 2019. Multilingual Gaussian Latent Dirichlet Allocation. Master’s thesis. Chalmers University of Technology and University of Gothenburg, Sweden.Google Scholar
- [18] . 2020. A simple and effective approach to robust unsupervised bilingual dictionary induction. In Proceedings of the 28th International Conference on Computational Linguistics.Google Scholar
Cross Ref
- [19] . 2000. Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley, New York.Google Scholar
Cross Ref
- [20] . 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. In Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics, 121–126. Google Scholar
Cross Ref
- [21] . 2013. Exploiting similarities among languages for machine translation. arXiv:1309.4168. https://arxiv.org/abs/1309.4168.Google Scholar
- [22] . 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., Red Hook, NY, 3111–3119.Google Scholar
Digital Library
- [23] . 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1532–1543. Google Scholar
Cross Ref
- [24] . 1966. A generalized solution of the orthogonal Procrustes problem. Psychometrika 31, 1 (1966), 1–10. https://EconPapers.repec.org/RePEc:spr:psycho:v:31:y:1966:i:1:p:1-10.Google Scholar
Cross Ref
- [25] . 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. CoRR abs/1702.03859 (2017).
arxiv:1702.03859 http://arxiv.org/abs/1702.03859Google Scholar - [26] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html.Google Scholar
- [27] . 2016. On the role of seed lexicons in learning bilingual word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 247–257. Google Scholar
Cross Ref
- [28] . 2020. Improving bilingual lexicon induction with unsupervised post-processing of monolingual word vector spaces. In Proceedings of the 5th Workshop on Representation Learning for NLP. 45–54.Google Scholar
Cross Ref
- [29] . 2016. Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55, 1 (
Jan. 2016), 953–994. Google ScholarDigital Library
- [30] . 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1006–1011. Google Scholar
Cross Ref
- [31] . 2016. Topic-based term translation models for statistical machine translation. Artificial Intelligence 232 (2016), 54–75. Google Scholar
Digital Library
- [32] . 2013. A topic-based coherence model for statistical machine translation. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI’13). AAAI Press, 977–983.Google Scholar
Cross Ref
- [33] . 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1959–1970. Google Scholar
Cross Ref
Index Terms
Topic-Based Unsupervised and Supervised Dictionary Induction
Recommendations
Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction
AbstractGreat progress has been made in unsupervised bilingual lexicon induction (UBLI) by aligning the source and target word embeddings independently trained on monolingual corpora. The common assumption of most UBLI models is that the ...
Highlights- Bilingual lexicon can be induced from only monolingual corpora of two languages.
Multilingual Topic Models for Bilingual Dictionary Extraction
A machine-readable bilingual dictionary plays a crucial role in many natural language processing tasks, such as statistical machine translation and cross-language information retrieval. In this article, we propose a framework for extracting a bilingual ...
Adaptive cross-contextual word embedding for word polysemy with unsupervised topic modeling
AbstractBecause of its efficiency, word embedding has been widely used in many natural language processing and text modeling tasks. It aims to represent each word by a vector so such that the geometry between these vectors can capture the ...






Comments