Abstract
Cross-lingual sentence embedding’s goal is mapping sentences with similar semantics but in different languages close together and dissimilar sentences farther apart in the representation space. It is the basis of many downstream tasks such as cross-lingual document matching and cross-lingual summary extraction. At present, the works of cross-lingual sentence embedding tasks mainly focus on languages with large-scale corpus. But low-resource languages such as Chinese-Vietnamese are short of sentence-level parallel corpora and clear cross-lingual monitoring signals, and these works on low-resource languages have poor performances. Therefore, we propose a cross-lingual sentence embedding method based on contrastive learning and effectively fine-tune powerful pretraining mode by constructing sentence-level positive and negative samples to avoid the catastrophic forgetting problem of the traditional fine-tuning pre-trained model based only on small-scale aligned positive samples. First, we construct positive and negative examples by taking parallel Chinese Vietnamese sentences as positive examples and non-parallel sentences as negative examples. Second, we construct a siamese network to get contrastive loss by inputting positive and negative samples and fine-tuning our model. The experimental results show that our method can effectively improve the semantic alignment accuracy of cross-lingual sentence embedding in Chinese and Vietnamese contexts.
- [1] . 2014. Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014).Google Scholar
- [2] . 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. CoRR abs/1703.02507 (2017).Google Scholar
- [3] . 2020. Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 616–625. Retrieved from https://aclanthology.org/2020.aacl-main.62/.Google Scholar
- [4] . 2021. Cross-lingual abstractive summarization with limited parallel resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 6910–6924.
DOI: Google ScholarCross Ref
- [5] . 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 3980–3990.
DOI: Google ScholarCross Ref
- [6] . 2019. Representation degeneration problem in training natural language generation models. In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=SkEYojRqtm.Google Scholar
- [7] . 2020. On the language neutrality of pre-trained multilingual representations. In Findings of the Association for Computational Linguistics(Findings of ACL). Association for Computational Linguistics, 1663–1674.
DOI: Google ScholarCross Ref
- [8] . 2021. Deep subjecthood: Higher-order grammatical features in multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2522–2532.
DOI: Google ScholarCross Ref
- [9] . 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 4996–5001.
DOI: Google ScholarCross Ref
- [10] . 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Ling. 7 (2019), 597–610. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/1742.Google Scholar
Cross Ref
- [11] . 2019. Polyglot contextual representations improve crosslingual transfer. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 3912–3918.
DOI: Google ScholarCross Ref
- [12] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186.
DOI: Google ScholarCross Ref
- [13] . 2019. Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019).Google Scholar
- [14] . 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 8440–8451.
DOI: Google ScholarCross Ref
- [15] . 2021. Bridging subword gaps in pretrain-finetune paradigm for natural language generation. CoRR abs/2106.06125 (2021).Google Scholar
- [16] . 2020. A simple framework for contrastive learning of visual representations. CoRR abs/2002.05709 (2020).Google Scholar
- [17] . 2020. Supervised contrastive learning. CoRR abs/2004.11362 (2020).Google Scholar
- [18] . 2021. SimCSE: Simple contrastive learning of sentence embeddings. CoRR abs/2104.08821 (2021).Google Scholar
- [19] . 2022. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4207–4218.
DOI: Google ScholarCross Ref
- [20] . 2021. Towards user-driven neural machine translation. CoRR abs/2106.06200 (2021).Google Scholar
- [21] . 1993. Signature verification using a “Siamese” time delay neural network. Int. J. Pattern Recog. Artif. Intell. 7, 4 (1993), 669–688.
DOI: Google ScholarCross Ref
- [22] . 2021. Chinese sentences similarity via cross-attention based siamese network. CoRR abs/2104.08787 (2021).Google Scholar
- [23] . 2018. Concatenated p-mean word embeddings as universal cross-lingual sentence representations. CoRR abs/1803.01400 (2018).Google Scholar
- [24] . 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998–6008. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google Scholar
- [25] . 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012).Google Scholar
- [26] . 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019).Google Scholar
- [27] . 2020. mT5: A massively multilingual pre-trained text-to-text transformer. CoRR abs/2010.11934 (2020).Google Scholar
- [28] . 2019. Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116 (2019).Google Scholar
- [29] . 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Retrieved from https://hal.inria.fr/hal-02131630.Google Scholar
Cross Ref
Index Terms
Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning
Recommendations
Automatic wordnet development for low-resource languages using cross-lingual WSD
Wordnets are an effective resource for natural language processing and information retrieval, especially for semantic processing and meaning related tasks. So far, wordnets have been constructed for many languages. However, the automatic development of ...
Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair
In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily ...
Exploring Multi-lingual, Multi-task, and Adversarial Learning for Low-resource Sentiment Analysis
Deep learning has become most prominent in solving various Natural Language Processing (NLP) tasks including sentiment analysis. However, these techniques require a considerably large amount of annotated corpus, which is not easy to obtain for most of the ...






Comments