skip to main content
research-article

Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning

Authors Info & Claims
Published:16 June 2023Publication History
Skip Abstract Section

Abstract

Cross-lingual sentence embedding’s goal is mapping sentences with similar semantics but in different languages close together and dissimilar sentences farther apart in the representation space. It is the basis of many downstream tasks such as cross-lingual document matching and cross-lingual summary extraction. At present, the works of cross-lingual sentence embedding tasks mainly focus on languages with large-scale corpus. But low-resource languages such as Chinese-Vietnamese are short of sentence-level parallel corpora and clear cross-lingual monitoring signals, and these works on low-resource languages have poor performances. Therefore, we propose a cross-lingual sentence embedding method based on contrastive learning and effectively fine-tune powerful pretraining mode by constructing sentence-level positive and negative samples to avoid the catastrophic forgetting problem of the traditional fine-tuning pre-trained model based only on small-scale aligned positive samples. First, we construct positive and negative examples by taking parallel Chinese Vietnamese sentences as positive examples and non-parallel sentences as negative examples. Second, we construct a siamese network to get contrastive loss by inputting positive and negative samples and fine-tuning our model. The experimental results show that our method can effectively improve the semantic alignment accuracy of cross-lingual sentence embedding in Chinese and Vietnamese contexts.

REFERENCES

  1. [1] Le Quoc V. and Mikolov Tomás. 2014. Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014).Google ScholarGoogle Scholar
  2. [2] Pagliardini Matteo, Gupta Prakhar, and Jaggi Martin. 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. CoRR abs/1703.02507 (2017).Google ScholarGoogle Scholar
  3. [3] El-Kishky Ahmed and Guzmán Francisco. 2020. Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 616625. Retrieved from https://aclanthology.org/2020.aacl-main.62/.Google ScholarGoogle Scholar
  4. [4] Bai Yu, Gao Yang, and Huang Heyan. 2021. Cross-lingual abstractive summarization with limited parallel resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 69106924. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Reimers Nils and Gurevych Iryna. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 39803990. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Gao Jun, He Di, Tan Xu, Qin Tao, Wang Liwei, and Liu Tie-Yan. 2019. Representation degeneration problem in training natural language generation models. In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=SkEYojRqtm.Google ScholarGoogle Scholar
  7. [7] Libovický Jindrich, Rosa Rudolf, and Fraser Alexander. 2020. On the language neutrality of pre-trained multilingual representations. In Findings of the Association for Computational Linguistics(Findings of ACL). Association for Computational Linguistics, 16631674. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Papadimitriou Isabel, Chi Ethan A., Futrell Richard, and Mahowald Kyle. 2021. Deep subjecthood: Higher-order grammatical features in multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 25222532. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Pires Telmo, Schlinger Eva, and Garrette Dan. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 49965001. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Artetxe Mikel and Schwenk Holger. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Ling. 7 (2019), 597610. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/1742.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Mulcaire Phoebe, Kasai Jungo, and Smith Noah A.. 2019. Polyglot contextual representations improve crosslingual transfer. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 39123918. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 41714186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Lample Guillaume and Conneau Alexis. 2019. Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019).Google ScholarGoogle Scholar
  14. [14] Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 84408451. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Liu Xin, Yang Baosong, Liu Dayiheng, Zhang Haibo, Luo Weihua, Zhang Min, Zhang Haiying, and Su Jinsong. 2021. Bridging subword gaps in pretrain-finetune paradigm for natural language generation. CoRR abs/2106.06125 (2021).Google ScholarGoogle Scholar
  16. [16] Chen Ting, Kornblith Simon, Norouzi Mohammad, and Hinton Geoffrey E.. 2020. A simple framework for contrastive learning of visual representations. CoRR abs/2002.05709 (2020).Google ScholarGoogle Scholar
  17. [17] Khosla Prannay, Teterwak Piotr, Wang Chen, Sarna Aaron, Tian Yonglong, Isola Phillip, Maschinot Aaron, Liu Ce, and Krishnan Dilip. 2020. Supervised contrastive learning. CoRR abs/2004.11362 (2020).Google ScholarGoogle Scholar
  18. [18] Gao Tianyu, Yao Xingcheng, and Chen Danqi. 2021. SimCSE: Simple contrastive learning of sentence embeddings. CoRR abs/2104.08821 (2021).Google ScholarGoogle Scholar
  19. [19] Chuang Yung-Sung, Dangovski Rumen, Luo Hongyin, Zhang Yang, Chang Shiyu, Soljacic Marin, Li Shang-Wen, Yih Scott, Kim Yoon, and Glass James R.. 2022. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 42074218. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Lin Huan, Yao Liang, Yang Baosong, Liu Dayiheng, Zhang Haibo, Luo Weihua, Huang Degen, and Su Jinsong. 2021. Towards user-driven neural machine translation. CoRR abs/2106.06200 (2021).Google ScholarGoogle Scholar
  21. [21] Bromley Jane, Bentz James W., Bottou Léon, Guyon Isabelle, LeCun Yann, Moore Cliff, Säckinger Eduard, and Shah Roopak. 1993. Signature verification using a “Siamese” time delay neural network. Int. J. Pattern Recog. Artif. Intell. 7, 4 (1993), 669688. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Wang Zhen, Zhang Xiangxie, and Tan Yicong. 2021. Chinese sentences similarity via cross-attention based siamese network. CoRR abs/2104.08787 (2021).Google ScholarGoogle Scholar
  23. [23] Rücklé Andreas, Eger Steffen, Peyrard Maxime, and Gurevych Iryna. 2018. Concatenated p-mean word embeddings as universal cross-lingual sentence representations. CoRR abs/1803.01400 (2018).Google ScholarGoogle Scholar
  24. [24] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 59986008. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google ScholarGoogle Scholar
  25. [25] Hinton Geoffrey E., Srivastava Nitish, Krizhevsky Alex, Sutskever Ilya, and Salakhutdinov Ruslan. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012).Google ScholarGoogle Scholar
  26. [26] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019).Google ScholarGoogle Scholar
  27. [27] Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, and Raffel Colin. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. CoRR abs/2010.11934 (2020).Google ScholarGoogle Scholar
  28. [28] Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2019. Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116 (2019).Google ScholarGoogle Scholar
  29. [29] Jawahar Ganesh, Sagot Benoît, and Seddah Djamé. 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Retrieved from https://hal.inria.fr/hal-02131630.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
      June 2023
      635 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3604597
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 June 2023
      • Online AM: 18 April 2023
      • Accepted: 13 March 2023
      • Revised: 1 February 2023
      • Received: 22 September 2022
      Published in tallip Volume 22, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)74
      • Downloads (Last 6 weeks)20

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!