Abstract
Automatic evaluation of machine translation is critical for the evaluation and development of machine translation systems. In this study, we propose a new model for automatic evaluation of machine translation. The proposed model combines standard n-gram precision features and sentence semantic mapping features with neural features, including neural language model probabilities and the embedding distances between translation outputs and their reference translations. We optimize the model with a representative list-wise learning to rank approach, ListMLE, in terms of human ranking assessments. The experimental results on WMT’2015 Metrics task indicated that the proposed approach yields significantly better correlations with human assessments than several state-of-the-art baseline approaches. In particular, the results confirmed that the proposed list-wise learning to rank approach is useful and powerful for optimizing automatic evaluation metrics in terms of human ranking assessments. Deep analysis also demonstrated that optimizing automatic metrics with the ListMLE approach is a reasonable method and adding the neural features can gain considerable improvements compared with the traditional features.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. 311--318. Google Scholar
Digital Library
- G. Doddington. 2002. Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research (HLT). Association for Computational Linguistics. 138--145. Google Scholar
Digital Library
- S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 65--72.Google Scholar
- M. Snover, B. Dorr, R. Schwartz, J. Makhoul, L. Micciulla, and R. Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas. Association for Machine Translation in the Americas, 223--231.Google Scholar
- Y. S. Chan and H. T. Ng. 2008. MAXSIM: A maximum similarity metric for machine translation evaluation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 55--62.Google Scholar
- C. Liu, D. Dahlmeier, and H. T. Ng. 2010. TESLA: Translation evaluation of sentences with linear-programming-based analysis. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics (MATR). Association for Computational Linguistics, 354--359. Google Scholar
Digital Library
- B. Chen, R. Kuhn, and S. Larkin. 2012. PORT: A precision-order-recall MT evaluation metric for tuning. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 930--939. Google Scholar
Digital Library
- R. Gupta, C. Orasan, and J. van Genabith. 2015. ReVal: A simple and effective machine translation evaluation metric based on recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1066--1072.Google Scholar
Cross Ref
- W. Wang, J.-T. Peter, H. Rosendahl, and H. Ney. 2016. CharacTer: Translation edit rate on character level. In Proceedings of the 1st Conference on Machine Translation. Association for Computational Linguistics, 505--510.Google Scholar
- C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, and J. Schroeder. 2007. (Meta-) evaluation of machine translation. In Proceedings of the 2nd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 136--158. Google Scholar
Digital Library
- M. Paul. 2008. Overview of the IWSLT’2008 evaluation campaign. In Proceedings of IWSLT’2008. Association for Computational Linguistics, 1--17.Google Scholar
- S. Corston-Oliver, M. Gamon, and C. Brockett. 2001. A machine learning approach to the automatic evaluation of machine translation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 148--155. Google Scholar
Digital Library
- A. Kulesza and S. M. Shieber. 2004. A learning approach to improving sentence-level MT evaluation. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation. Skövde University Studies in Informatics, 75--84.Google Scholar
- J. Albrecht and R. Hwa. 2008. Regression for machine translation evaluation at the sentence level. Machine Translation 22. 1, 1--27. Google Scholar
Digital Library
- L. Specia and J. Giménez. 2010. Combining confidence estimation and reference-based metrics for segment-level MT evaluation. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas. Association for Machine Translation in the Americas, 1--10.Google Scholar
- K. Duh. 2008. Ranking vs. regression in machine translation evaluation. In Proceedings of the 3rd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 191--194. Google Scholar
Digital Library
- F. Guzman, S. Joty, L. Marquez, and P. Nakov. 2015. Pairwise neural machine translation evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics. 805--814.Google Scholar
- X. Song and T. Cohn. 2011. Regression and ranking based optimization for sentence level MT evaluation. In Proceedings of the 6th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 123--129. Google Scholar
Digital Library
- F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank -- theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning. JMLR, 1192--1199. Google Scholar
Digital Library
- C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, and J. Schroeder. 2008. Further meta-evaluation of machine translation. In Proceedings of the 3rd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 70-106. Google Scholar
Digital Library
- T. Joachims, T. Finley, and C.-N. Yu. 2009. Cutting-plane training of structural SVMs. Machine Learning 77, 1, 27--59. Google Scholar
Digital Library
- T. Joachims. 1999. Making large-scale support vector machine learning practical. Advances in Kernel Methods. MIT Press, 169--184. Google Scholar
Digital Library
- Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. 2007. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning. JMLR, 129--136. Google Scholar
Digital Library
- H. Li. 2011. A short introduction to learning to rank. IEICE Transactions on Information and Systems, E94-D. 1--9.Google Scholar
- M. Machacek and O. Bojar. 2014. Results of the WMT14 metrics shared task. In Proceedings of the 9th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 293--301.Google Scholar
- M. Stanojevic, A. Kamran, P. Koehn, and O. Bojar. 2015. Results of the WMT15 metrics shared task. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 256--273.Google Scholar
- T. Y. Liu. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3. 225--331. Google Scholar
Digital Library
- Y. Yue, T. Finley, F. Radlinski, and T. Joachims. 2007. A support vector method for optimizing average precision. In Proceedings of the 30th Annual International ACM SIGIR Conference. Association for Computing Machinery. 271--278. Google Scholar
Digital Library
- M. Taylor, J. Guiver, S. Robertson, and T. Minka. 2008. Soft-Rank: Optimizing non-smooth rank metrics. In Proceedings of the International Conference on Web Search and Web Data Mining. Association for Computing Machinery, New York, 77--86. Google Scholar
Digital Library
- R. L. Plackett. 1975. The analysis of permutations. Applied Statistics, 24. 2, 193--202.Google Scholar
Cross Ref
- A. Stolcke. 2002. SRILM - An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing. ISCA, 901--904.Google Scholar
- X. He, M. Yang, J. Gao, P. Nguyen, and R. Moore. 2008. Indirect-HMM-based hypothesis alignment for combining outputs from machine translation systems. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 98--107. Google Scholar
Digital Library
- M. Li, M. Wang, H. Li, and F. Xu. 2016. Modeling monolingual character alignment for automatic evaluation of Chinese translation. ACM Transactions on Asian and Low‐Resource Language Information Processing, 15. 3, 1--16. Google Scholar
Digital Library
- P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19. 263--311. Google Scholar
Digital Library
- K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bingio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1724--1734.Google Scholar
- T. Mikolov, M. Karafiat, L. Burget, J. H. Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association. International Speech Communication Association, 1045--1048.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26. 3111--3119. Google Scholar
Digital Library
- B. Chen and H. Guo. 2015. Representation based translation evaluation metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 150--155.Google Scholar
- O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, M. Huck, C. Hokamp, P. Koehn, V. Logacheva, C. Monz, M. Negri, M. Post, C. Scarton, L. Specia, and M. Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 1--46.Google Scholar
- M. Denkowski and A. Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 376--380.Google Scholar
- R. Gupta, C. Orasan, and J. van Genabith. 2015. Machine translation evaluation using recurrent neural networks. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 380--384.Google Scholar
Cross Ref
- O. Bojar, Y. Graham, A. Kamran, and M. Stanojević. 2016. Results of the WMT16 metrics shared task. In Proceedings of the 1st Conference on Machine Translation. Association for Computational Linguistics, 232--238.Google Scholar
- P. Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 388--395.Google Scholar
- Y. Zhang and S. Vogel. 2010. Significance tests of automatic machine translation evaluation metrics. Machine Translation, 24, 1, 51--65. Google Scholar
Digital Library
Index Terms
Optimizing Automatic Evaluation of Machine Translation with the ListMLE Approach
Recommendations
Automatic Machine Translation Evaluation Based on Sentence Structure Information
IALP '09: Proceedings of the 2009 International Conference on Asian Language ProcessingAutomatic evaluation of machine translation plays an important role in improving the performance of machine translation systems. In this paper, we firstly introduce three traditional methods of automatic evaluation, including BLEU, NIST and WER. All ...
Dependency-based automatic evaluation for machine translation
SSST '07: Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical TranslationWe present a novel method for evaluating the output of Machine Translation (MT), based on comparing the dependency structures of the translation and reference rather than their surface string forms. Our method uses a treebank-based, widecoverage, ...
Predicate-argument reordering based on learning to rank for English-Korean machine translation
ICUIMC '11: Proceedings of the 5th International Conference on Ubiquitous Information Management and CommunicationIn this paper, we propose a method of learning predicate-argument structure reordering, and present its effect on machine translation. The method takes two steps; first, it extracts generalized predicate-argument structure reordering rules using a ...






Comments