Abstract
Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatic translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. However, the lack of large-scale parallel corpus limits the study of machine translation in ancient–modern Chinese. In this article, we propose an ancient–modern Chinese clause alignment approach based on the characteristics of these two languages. This method combines both lexical-based information and statistical-based information, which achieves 94.2 F1-score on our manual annotation Test set. We use this method to create a new large-scale ancient–modern Chinese parallel corpus that contains 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality ancient–modern Chinese dataset. Furthermore, we analyzed and compared the performance of the SMT and various NMT models on this dataset and provided a strong baseline for this task.
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer. 1991. Aligning sentences in parallel corpora. In ACL. Google Scholar
Digital Library
- Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
- Andrew Finch, Taisuke Harada, Kumiko Tanaka-Ishii, and Eiichiro Sumita. 2017. Inducing a bilingual lexicon from short parallel multiword sequences. ACM Trans. Asian Low-Res. Lang. Inf. Process. 16, 3 (2017), 15:1--15:20. Google Scholar
Digital Library
- Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’16). Google Scholar
Digital Library
- William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Comput. Ling. 19, 1 (1993), 75--102. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’16).Google Scholar
Cross Ref
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. Google Scholar
Digital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Hiroyuki Kaji, Yuuko Kida, and Yasutsugu Morimoto. 1992. Learning translation templates from bilingual text. In Computational Linguistics. Google Scholar
Digital Library
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Chunyu Kit, Jonathan J. Webster, King-Kui Sin, Haihua Pan, and Heng Li. 2004. Clause alignment for Hong Kong legal texts: A lexical-based approach. Int. J. Corpus Ling. 9, 1 (2004), 29--51.Google Scholar
Cross Ref
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the ACL on Interactive Poster and Demonstration Sessions. Google Scholar
Digital Library
- Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. 2017. Modeling source syntax for neural machine translation. arXiv preprint arXiv:1705.01020 (2017).Google Scholar
- Zhun Lin and Xiaojie Wang. 2007. Chinese ancient-modern sentence alignment. In Proceedings of the International Conference on Computational Science. Google Scholar
Digital Library
- Ying Liu and Nan Wang. 2012. Sentence alignment for ancient and modern Chinese parallel corpus. In Emerging Research in Artificial Intelligence and Computational Intelligence.Google Scholar
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).Google Scholar
- Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’13). Google Scholar
Digital Library
- Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2018. A generalized constraint approach to bilingual dictionary induction for low-resource language families. ACM Trans. Asian Low-Res. Lang. Inf. Process. 17, 2 (2018), 9:1--9:29. Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’02). Google Scholar
Digital Library
- Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 11 (1997), 2673--2681. Google Scholar
Digital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’14). Google Scholar
Digital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’17). Google Scholar
Digital Library
- Xiaojie Wang and Fuji Ren. 2005. Chinese-Japanese clause alignment. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Google Scholar
Digital Library
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
- Liu Yang, Tu Zhaopeng, Fandong Meng, Yong Cheng, and Junjie Zhai. 2018. Towards robust neural machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’18).Google Scholar
- Zhiyuan Zhang, Wei Li, and Xu Sun. 2018. Automatic transferring between ancient Chinese and contemporary Chinese. arXiv preprint arXiv:1803.01557 (2018).Google Scholar
Index Terms
Ancient–Modern Chinese Translation with a New Large Training Dataset
Recommendations
Chinese Ancient-Modern Sentence Alignment
ICCS '07: Proceedings of the 7th international conference on Computational Science, Part IIBi-text alignment is useful to many Natural Language Processing tasks such as machine translation, bilingual lexicography and word sense disambiguation. Most of previous researches are on different language pairs. This paper presents a diachronic ...
An automatic evaluation metric for Ancient-Modern Chinese translation
AbstractAs a written language used for thousands of years, Ancient Chinese has some special characteristics like complex semantics as polysemy and the one-to-many alignment with Modern Chinese. Thus it may be translated in a large number of fully ...
Automatic Translating Between Ancient Chinese and Contemporary Chinese with Limited Aligned Corpora
Natural Language Processing and Chinese ComputingAbstractThe Chinese language has evolved a lot during the long-term development. Therefore, native speakers now have trouble in reading sentences written in ancient Chinese. In this paper, we propose to build an end-to-end neural model to automatically ...






Comments