Abstract
Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. We improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. We propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese--Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT performance.
- Sadaf Abdul-Rauf and Holger Schwenk. 2011. Parallel sentence generation from comparable corpora for improved SMT. Machine Translation 25, 4, 341--375. Google Scholar
Digital Library
- Sisay Fissaha Adafre and Maarten de Rijke. 2006. Finding similar sentences across multiple languages in Wikipedia. In Proceedings of the Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources. 62--69.Google Scholar
- Haithem Afli, Loïc Barrault, and Holger Schwenk. 2013. Multimodal comparable corpora as resources for extracting parallel data: Parallel phrases extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 286--292. http://www.aclweb.org/anthology/I13-1033.Google Scholar
- Ahmet Aker, Yang Feng, and Robert Gaizauskas. 2012. Automatic bilingual phrase extraction from comparable corpora. In Proceedings of COLING 2012: Posters. 23--32. http://www.aclweb.org/anthology/C12-2003.Google Scholar
- Ahmet Aker, Monica Paramita, Marcis Pinnis, and Robert Gaizauskas. 2014. Bilingual dictionaries for all EU languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 26--31. http://www.lrec-conf.org/proceedings/lrec2014/pdf/803_Paper.pdf.Google Scholar
- Rohit G. Bharadwaj and Vasudeva Varma. 2011. Language independent identification of parallel sentences using Wikipedia. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW’11). ACM, New York, NY, 11--12. DOI:http://dx.doi.org/10.1145/1963192.1963199 Google Scholar
Digital Library
- Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Association for Computational Linguistics 19, 2, 263--312. Google Scholar
Digital Library
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, 27:1--27:27. Google Scholar
Digital Library
- Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, and Sadao Kurohashi. 2013a. Chinese--Japanese machine translation exploiting Chinese characters. ACM Transactions on Asian Language Information Processing 12, 4, Article No. 16. DOI:http://dx.doi.org/10.1145/2523057.2523059 Google Scholar
Digital Library
- Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2013b. Accurate parallel fragment extraction from quasi--comparable corpora using alignment model and translation lexicon. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 1144--1150. http://www.aclweb.org/anthology/I13-1163.Google Scholar
- Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2014. Constructing a Chinese--Japanese parallel corpus from Wikipedia. In Proceedings of the 9th Conference on International Language Resources and Evaluation (LREC’14). 642--647.Google Scholar
- Thi Ngoc Diep Do, Laurent Besacier, and Eric Castelli. 2010. A fully unsupervised approach for mining parallel data from comparable corpora. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT’10).Google Scholar
- Xiaoyin Fu, Wei Wei, Shixiang Lu, Zhenbiao Chen, and Bo Xu. 2013. Phrase-based parallel fragments extraction from comparable corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 972--976. http://www.aclweb.org/anthology/I13-1129.Google Scholar
- Pascale Fung and Percy Cheung. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of Coling 2004. 1051--1057. Google Scholar
Digital Library
- Pascale Fung, Emmanuel Prochasson, and Simon Shi. 2010. Trillions of comparable documents. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (BUCC’10) and Language Resource and Evaluation Conference (LREC’10). 26--34.Google Scholar
- Souhir Gahbiche-Braham, Hélène Bonneau-Maynard, and François Yvon. 2011. Two ways to use a noisy parallel news corpus for improving statistical machine translation. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 44--51. http://www.aclweb.org/anthology/W11-1207. Google Scholar
Digital Library
- Chooi-Ling Goh, Masayuki Asahara, and Yuji Matsumoto. 2005. Building a Japanese--Chinese dictionary using Kanji/Hanzi conversion. In Proceedings of the International Joint Conference on Natural Language Processing. 670--681. http://www.aclweb.org/anthology/I/I05/I05-1059.pdf. Google Scholar
Digital Library
- Rajdeep Gupta, Santanu Pal, and Sivaji Bandyopadhyay. 2013. Improving MT system using extracted parallel fragments of text from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora. 69--76. http://www.aclweb.org/anthology/W13-2509.Google Scholar
- Sanjika Hewavitharana and Stephan Vogel. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 61--68. http://www.aclweb.org/anthology/W11-1209. Google Scholar
Digital Library
- Gumwon Hong, Chi-Ho Li, Ming Zhou, and Hae-Chang Rim. 2010. An empirical study on Web mining of parallel data. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling’10). 474--482. http://www.aclweb.org/anthology/C10-1054. Google Scholar
Digital Library
- Tatsuya Ishisaka, Masao Utiyama, Eiichiro Sumita, and Kazuhide Yamamoto. 2009. Development of a Japanese--English software manual parallel corpus. In Proceedings of the MT Summit.Google Scholar
- Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu, and Qingsheng Zhu. 2009. Mining bilingual data from the Web with adaptively learnt patterns. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 870--878. http://www.aclweb.org/anthology/P/P09/P09-1098. Google Scholar
Digital Library
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).388--395.Google Scholar
- Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York, NY. Google Scholar
Digital Library
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177--180. http://www.aclweb.org/anthology/P/P07/P07-2045. Google Scholar
Digital Library
- Sadao Kurohashi, Toshihisa Nakamura, Yuji Matsumoto, and Makoto Nagao. 1994. Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language. 22--28.Google Scholar
- Wang Ling, Guang Xiang, Chris Dyer, Alan Black, and Isabel Trancoso. 2013. Microblogs as parallel corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 176--186. http://www.aclweb.org/anthology/P13-1018.Google Scholar
- Bin Lu, Tao Jiang, Kapo Chow, and Benjamin K. Tsou. 2010. Building a large English--Chinese parallel corpus from comparable patents and its experimental application to SMT. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (LREC’10). 42--49.Google Scholar
- Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann, and Stanley Peters. 2000. A bootstrapping method for extracting bilingual text pairs. In Proceedings of the 18th Conference on Computational Linguistics, Volume 2 (COLING’00). 1066--1070. DOI:http://dx.doi.org/10.3115/992730.992806 Google Scholar
Digital Library
- Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31, 4, 477--504. Google Scholar
Digital Library
- Dragos Stefan Munteanu and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 81--88. DOI:http://dx.doi.org/10.3115/1220175.1220186 Google Scholar
Digital Library
- Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, New York, NY, 74--81. DOI:http://dx.doi.org/10.1145/312624.312656 Google Scholar
Digital Library
- Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 160--167. DOI:http://dx.doi.org/10.3115/1075096.1075117 Google Scholar
Digital Library
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1, 19--51. DOI:http://dx.doi.org/10.1162/089120103321337421 Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics. 311--318. DOI:http://dx.doi.org/10.3115/1073083.1073135 Google Scholar
Digital Library
- Chris Quirk, Raghavendra U. Udupa, and Arul Menezes. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of MT Summit XI.Google Scholar
- Philip Resnik and Noah A. Smith. 2003. The Web as a parallel corpus. Computational Linguistics 29, 3, 349--380. DOI:http://dx.doi.org/10.1162/089120103322711578 Google Scholar
Digital Library
- Jason Riesa and Daniel Marcu. 2012. Automatic parallel fragment extraction from noisy data. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 538--542. http://www.aclweb.org/anthology/N12-1061. Google Scholar
Digital Library
- Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 403--411. http://www.aclweb.org/anthology/N10-1063. Google Scholar
Digital Library
- Dan Ştefǎnescu and Radu Ion. 2013. Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia. In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’13). 117--128.Google Scholar
- Dan Ştefǎnescunescu, Radu Ion, and Sabine Hunsicker. 2012. Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT’12). Trento, Italy, 137--144.Google Scholar
- Chew Lim Tan and Makoto Nagao. 1995. Automatic alignment of Japanese--Chinese bilingual texts. IEICE Transactions on Information and Systems E78-D, 1, 68--76.Google Scholar
- Christoph Tillmann. 2009. A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. 225--228. http://www.aclweb.org/anthology/P/P09/P09-2057. Google Scholar
Digital Library
- Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling’10). 1101--1109. http://www.aclweb.org/anthology/C10-1124. Google Scholar
Digital Library
- Masao Utiyama and Hitoshi Isahara. 2003. Reliable measures for aligning Japanese--English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 72--79. DOI:http://dx.doi.org/10.3115/1075096.1075106 Google Scholar
Digital Library
- Masao Utiyama and Hitoshi Isahara. 2007. A Japanese--English patent parallel corpus. In Proceedings of MT Summit XI. 475--482.Google Scholar
- Ivan Vulić and Marie-Francine Moens. 2012. Sub-corpora sampling with an application to bilingual lexicon extraction. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 2721--2738. http://www.aclweb.org/anthology/C12-1166.Google Scholar
- Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu Dong Chiou, and Shizhe Huang. 2000. Developing guidelines and ensuring consistency for Chinese text annotation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation.Google Scholar
- Jiajun Zhang and Chengqing Zong. 2013. Learning a phrase-based translation model from monolingual data with application to domain adaptation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1425--1434. http://www.aclweb.org/anthology/P13-1140.Google Scholar
- Ying Zhang, Ke Wu, Jianfeng Gao, and Phil Vines. 2006. Automatic acquisition of Chinese--English parallel corpus from the Web. In Advances in Information Retrieval. Lecture Notes in Computer Science, Vol. 3936. Springer, 420--431. http://dblp.uni-trier.de/db/conf/ecir/ecir2006.html#ZhangWGV06. Google Scholar
Digital Library
- Bing Zhao and Stephan Vogel. 2002. Adaptive parallel sentences mining from Web bilingual news collections. In Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE, Los Alamitos, CA, 745--748. Google Scholar
Digital Library
Index Terms
Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese--Japanese Wikipedia
Recommendations
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora
An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Multilingual sentence alignment from Wikipedia as multilingual comparable corpora
HC '10: Proceedings of the 13th International Conference on Humans and ComputersBilingual dictionaries and the multilingual dictionaries are necessary resources for machine translation and cross language information retrieval. With the help of these dictionaries, an information retrieval system can find documents of similar content ...
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningIntegrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...






Comments