Abstract
This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high-resource languages, this disadvantage might not be effective due to the high frequency of co-occurrence. However, when there are not enough available resources, such pairs suffer from being distant. To favour such pairs, a weighting scheme based on a polynomial fitting procedure is proposed to shift the weights up for distant words while the weights of nearby words are left almost unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for the English, Italian, and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes, as if these languages are low-resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e., mimicking low-resource languages) of both English and Italian. To show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low-resource language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as a corpus. The positive effect of the proposed scheme has also been demonstrated in a standard sentiment analysis task for the Turkish language.
- Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 19--27. Google Scholar
Digital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Giacomo Berardi, Andrea Esuli, and Diego Marcheggiani. 2015. Word embeddings go to Italy: A comparison of models and training datasets. In Proceedings of the Italian Information Retrieval Workshop.Google Scholar
- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016).Google Scholar
- Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49, 2014 (2014), 1--47. Google Scholar
Cross Ref
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391.Google Scholar
Cross Ref
- Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014).Google Scholar
- John R. Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis (1957).Google Scholar
- Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882. Google Scholar
Digital Library
- Alexander G. Huth, Wendy A. de Heer, Thomas L. Griffiths, Frédéric E. Theunissen, and Jack L. Gallant. 2016. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 7600 (2016), 453--458.Google Scholar
Cross Ref
- Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomas Mikolov. 2016b. FastText. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).Google Scholar
- Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016a. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).Google Scholar
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google Scholar
Cross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. 302--308.Google Scholar
Cross Ref
- Pierre Lison and Andrei Kutuzov. 2017. Redefining context windows for word embedding models: An experimental study. arXiv preprint arXiv:1411.4166 (2017).Google Scholar
- Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the Computational Natural Language Learning Conference (CoNLL’13). 104--113.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).Google Scholar
- Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 13. 746--751.Google Scholar
- George Miller and Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge.Google Scholar
- George A Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41. Google Scholar
Digital Library
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Vol. 14. 1532--1543.Google Scholar
- Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web. ACM, 337--346. Google Scholar
Digital Library
- Peter Mark Roget. 1911. Roget’s Thesaurus of English Words and Phrases. TY Crowell Company.Google Scholar
- Peter Mark Roget. 2008. Roget’s International Thesaurus, 3/E. Oxford and IBH Publishing.Google Scholar
- Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620. Google Scholar
Digital Library
- Lütfi Kerem Şenel, Veysel Yücesoy, Aykut Koç, and Tolga Çukur. 2017a. Measuring cross-lingual semantic similarity across European languages. In Proceedings of the Conference on Telecommunications and Signal Processing. IEEE.Google Scholar
Cross Ref
- Lütfi Kerem Şenel, Veysel Yücesoy, Aykut Koç, and Tolga Çukur. 2017b. Semantic similarity between Turkish and European languages using word embeddings. In Proceedings of the Signal Processing and Communications Applications Conference (SIU’17). IEEE, 1--4.Google Scholar
Cross Ref
- Christopher Tillmann and Hermann Ney. 1997. Word triggers and the EM algorithm. In Proceedings of the Computational Natural Language Learning Conference (CoNLL’97). 117--124.Google Scholar
- Veysel Yücesoy and Aykut Koç. 2016. Effect of the training set on the word embeddings and similarity test set for Turkish. In Proceedings of the Signal Processing and Communication Application Conference (SIU’16). IEEE, 1005--1008.Google Scholar
Cross Ref
- Veysel Yücesoy and Aykut Koç. 2017. Effect of cooccurance weighting to English word embeddings. In Proceedings of the Signal Processing and Communications Applications Conference (SIU’17). IEEE, 1--4.Google Scholar
Cross Ref
Index Terms
Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages
Recommendations
Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages
Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial IntelligenceIn recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Linguistically-augmented perplexity-based data selection for language models
HighlightsWord-level linguistic information for perplexity-based data selection.Evaluation and analysis for four languages: English, Spanish, Czech and Chinese.Combination of models lead to lower perplexity than the state-of-the-art baseline. This paper ...






Comments