Abstract
Word alignment is an important task of detecting translation equivalents between a sentence pair. Although word alignment is no longer necessarily needed for neural machine translation, it’s still useful in a wealth of applications, e.g., bilingual lexicon induction, constraint decoding, and so on. However, the most well-known word aligners are still Giza++ and fastAlign, both of which are implementations of traditional IBM models. To keep pace with the advance in NMT, there has been a surge of interest in replacing the IBM models with neural models. We follow this trend but aim to boost performance of word alignment between Japanese and Chinese, which share a large portion of Chinese characters. Our key idea is to leverage these common Chinese characters in both languages as an indicator for inferring alignment; i.e., the source and target words with the common Chinese characters should be most likely aligned. Following this idea, we propose three methods that leverage common Chinese characters to boost the mBERT-based word alignment, including reward factor, representation alignment, and contrastive training. Furthermore, we annotate and release a golden dataset for Japanese-Chinese word alignment. Experiments on the dataset show that our methods outperform several strong baselines in terms of AER score and verify the effectiveness of exploiting common Chinese characters.
- [1] . 1993. The mathematics of statistical machine translation. Computational Linguistics 19, 2 (1993), 263–311.Google Scholar
- [2] . 2020. Multilingual alignment of contextual word representations. In Proceedings of ICLR (ICLR’20). Addis Ababa.Google Scholar
- [3] . 2021. Mask-align: Self-supervised neural word alignment. In Proceedings of ACL/IJCNLP (ACL/IJCNLP.21), Vol. 1, 4781–4791.Google Scholar
Cross Ref
- [4] . 2020. A simple framework for contrastive learning of visual representations. In Proceedings of ICML (ICML’20). Vol. 119, 1597–1607.Google Scholar
- [5] . 2020. Accurate word alignment induction from neural machine translation. In Proceedings of EMNLP (EMNLP’20). 566–576.Google Scholar
- [6] . 2013. Chinese-Japanese machine translation exploiting chinese characters. ACM Trans. Asian Lang. Inf. Process. 12, 4 (2013), 16:1–16:25.Google Scholar
Digital Library
- [7] . 2019. Cross-lingual language model pretraining. In Proceedings of NIPS (NIPS’19). Vol. 1.Google Scholar
- [8] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (NAACL-HLT’19). Vol. 32, 4171–4186.Google Scholar
- [9] . 2021. Word alignment by fine-tuning embeddings on parallel corpora. In Proceedings of EACL (EACL’21). 2112–2128.Google Scholar
Cross Ref
- [10] . 2013. A simple, fast, and effective reparameterization of IBM model 2. In Proceedings of HLT-NAACL (HLT-NAACL’13). 644–648.Google Scholar
- [11] . 2019. Representation degeneration problem in training natural language generation models. CoRR abs/1907.12009 (2019).Google Scholar
- [12] . 2019. Jointly learning to align and translate with transformer models. In Proceedings of EMNLP-IJCNLP (EMNLP-IJCNLP’19). Vol. 1, 4452–4461.Google Scholar
- [13] . 2018. Cross-lingual annotation projection is effective for neural part-of-speech tagging. In Proceedings of the 6th Workshop on NLP for Similar Languages, Varieties and Dialects (WSSVD’18). Vol. 1.Google Scholar
- [14] . 2020. Cross-lingual alignment methods for multilingual BERT: A comparative study. In Proceedings of EMNLP (EMNLP’20). 933–942.Google Scholar
- [15] . 2023. Exploiting Japanese-Chinese cognates with shared private representations for neural machine translation. ACM Trans. Asian Lang. Inf. Process. 22, 1 (2023), 28:1–28:12.Google Scholar
- [16] . 2020. A supervised word alignment method based on cross-language span prediction using multilingual BERT. In Proceedings of EMNLP (EMNLP’20). 555–565.Google Scholar
- [17] . 2016. ASPEC: Asian scientific paper excerpt corpus. In Proceedings of LREC (LREC’16).Google Scholar
- [18] . 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19–51.Google Scholar
Digital Library
- [19] . 2012. Annotated corpora for word alignment between Japanese and English and its evaluation with MAP-based word aligner. In Proceedings of ELRA (ELRA’12).Google Scholar
- [20] . 2017. Neural automatic post-editing using prior alignment and reranking. In Proceedings of EACL (EACL’17). Vol. 2, 349–355.Google Scholar
- [21] . 2019. How multilingual is multilingual BERT? In Proceedings of ACL (ACL’19), Vol. 1, 4996–5001.Google Scholar
Cross Ref
- [22] . 2020. SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings. In Proceedings of EMNLP (EMNLP’20). 1627–1643.Google Scholar
- [23] . 2012. Japanese and Korean voice search. In Proceedings of ICASSP (ICASSP’12). 5149–5152.Google Scholar
- [24] . 2016. Convolution-enhanced bilingual recursive neural network for bilingual semantic modeling. In Proceedings of COLING (COLING’16). 3071–3081.Google Scholar
- [25] . 2016. Incorporating discrete translation lexicons into neural machine translation. In Proceedings of EMNLP (EMNLP’16). 1557–1567.Google Scholar
- [26] . 2021. BERT, mBERT, or BiBERT? A study on contextualized embeddings for neural machine translation. In Proceedings of EMNLP (EMNLP’21). 6663–6675.Google Scholar
- [27] . 2019. Adding interpretable attention to neural translation models improves word alignment. CoRR abs/1901.11359 (2019).Google Scholar
- [28] . 2017. BattRAE: Bidimensional attention-based recursive autoencoders for learning bilingual phrase embeddings. In Proceedings of AAAI (AAAI’17). 3372–3378.Google Scholar
- [29] . 2020. Alignment-supervised bidimensional attention-based recursive autoencoders for bilingual phrase representation. IEEE Trans. Cybern. 50, 2 (2020), 503–513.Google Scholar
Cross Ref
- [30] . 2022. Frequency-aware contrastive learning for neural machine translation. In Proceedings of AAAI (AAAI’22), Vol. 119, 11712–11720.Google Scholar
Cross Ref
- [31] . 2008. Word alignment annotation in a Japanese-Chinese parallel corpus. In Proceedings of LREC (LREC’08).Google Scholar
Index Terms
Multilingual BERT-based Word Alignment By Incorporating Common Chinese Characters
Recommendations
A Word Segmentation Method of Ancient Chinese Based on Word Alignment
Natural Language Processing and Chinese ComputingAbstractSince there are no public tagged corpora available for ancient Chinese word segmentation (CWS), the state-of-the-art CWS methods cannot be used for ancient Chinese. To address this problem, this paper proposes a word segmentation method based on ...
A Word Alignment Algorithm of Laos-Chinese Based on Language Feature
ICVR 2018: Proceedings of the 4th International Conference on Virtual RealityLao and Chinese are an isolated language, and Laotian words have no morphological changes, and word order and imaginary words are important means of expressing grammatical meaning, There is no natural interval between words and words, and there is no ...
Research of English-Chinese Alignment at Word Granularity on Parallel Corpora
ICIS '08: Proceedings of the Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008)Bilingual alignment is a crucial problem in the research of natural language processing, and word alignment is a nodus among all granularities of alignment. This paper describes an English-Chinese word alignment model based on a bilingual lexicon and ...






Comments