skip to main content
research-article

Optimizing Automatic Evaluation of Machine Translation with the ListMLE Approach

Published:12 November 2018Publication History
Skip Abstract Section

Abstract

Automatic evaluation of machine translation is critical for the evaluation and development of machine translation systems. In this study, we propose a new model for automatic evaluation of machine translation. The proposed model combines standard n-gram precision features and sentence semantic mapping features with neural features, including neural language model probabilities and the embedding distances between translation outputs and their reference translations. We optimize the model with a representative list-wise learning to rank approach, ListMLE, in terms of human ranking assessments. The experimental results on WMT’2015 Metrics task indicated that the proposed approach yields significantly better correlations with human assessments than several state-of-the-art baseline approaches. In particular, the results confirmed that the proposed list-wise learning to rank approach is useful and powerful for optimizing automatic evaluation metrics in terms of human ranking assessments. Deep analysis also demonstrated that optimizing automatic metrics with the ListMLE approach is a reasonable method and adding the neural features can gain considerable improvements compared with the traditional features.

References

  1. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Doddington. 2002. Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research (HLT). Association for Computational Linguistics. 138--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 65--72.Google ScholarGoogle Scholar
  4. M. Snover, B. Dorr, R. Schwartz, J. Makhoul, L. Micciulla, and R. Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas. Association for Machine Translation in the Americas, 223--231.Google ScholarGoogle Scholar
  5. Y. S. Chan and H. T. Ng. 2008. MAXSIM: A maximum similarity metric for machine translation evaluation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 55--62.Google ScholarGoogle Scholar
  6. C. Liu, D. Dahlmeier, and H. T. Ng. 2010. TESLA: Translation evaluation of sentences with linear-programming-based analysis. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics (MATR). Association for Computational Linguistics, 354--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Chen, R. Kuhn, and S. Larkin. 2012. PORT: A precision-order-recall MT evaluation metric for tuning. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 930--939. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Gupta, C. Orasan, and J. van Genabith. 2015. ReVal: A simple and effective machine translation evaluation metric based on recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1066--1072.Google ScholarGoogle ScholarCross RefCross Ref
  9. W. Wang, J.-T. Peter, H. Rosendahl, and H. Ney. 2016. CharacTer: Translation edit rate on character level. In Proceedings of the 1st Conference on Machine Translation. Association for Computational Linguistics, 505--510.Google ScholarGoogle Scholar
  10. C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, and J. Schroeder. 2007. (Meta-) evaluation of machine translation. In Proceedings of the 2nd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 136--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Paul. 2008. Overview of the IWSLT’2008 evaluation campaign. In Proceedings of IWSLT’2008. Association for Computational Linguistics, 1--17.Google ScholarGoogle Scholar
  12. S. Corston-Oliver, M. Gamon, and C. Brockett. 2001. A machine learning approach to the automatic evaluation of machine translation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 148--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Kulesza and S. M. Shieber. 2004. A learning approach to improving sentence-level MT evaluation. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation. Skövde University Studies in Informatics, 75--84.Google ScholarGoogle Scholar
  14. J. Albrecht and R. Hwa. 2008. Regression for machine translation evaluation at the sentence level. Machine Translation 22. 1, 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Specia and J. Giménez. 2010. Combining confidence estimation and reference-based metrics for segment-level MT evaluation. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas. Association for Machine Translation in the Americas, 1--10.Google ScholarGoogle Scholar
  16. K. Duh. 2008. Ranking vs. regression in machine translation evaluation. In Proceedings of the 3rd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 191--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Guzman, S. Joty, L. Marquez, and P. Nakov. 2015. Pairwise neural machine translation evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics. 805--814.Google ScholarGoogle Scholar
  18. X. Song and T. Cohn. 2011. Regression and ranking based optimization for sentence level MT evaluation. In Proceedings of the 6th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 123--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank -- theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning. JMLR, 1192--1199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, and J. Schroeder. 2008. Further meta-evaluation of machine translation. In Proceedings of the 3rd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 70-106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Joachims, T. Finley, and C.-N. Yu. 2009. Cutting-plane training of structural SVMs. Machine Learning 77, 1, 27--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Joachims. 1999. Making large-scale support vector machine learning practical. Advances in Kernel Methods. MIT Press, 169--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. 2007. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning. JMLR, 129--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Li. 2011. A short introduction to learning to rank. IEICE Transactions on Information and Systems, E94-D. 1--9.Google ScholarGoogle Scholar
  25. M. Machacek and O. Bojar. 2014. Results of the WMT14 metrics shared task. In Proceedings of the 9th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 293--301.Google ScholarGoogle Scholar
  26. M. Stanojevic, A. Kamran, P. Koehn, and O. Bojar. 2015. Results of the WMT15 metrics shared task. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 256--273.Google ScholarGoogle Scholar
  27. T. Y. Liu. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3. 225--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Yue, T. Finley, F. Radlinski, and T. Joachims. 2007. A support vector method for optimizing average precision. In Proceedings of the 30th Annual International ACM SIGIR Conference. Association for Computing Machinery. 271--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Taylor, J. Guiver, S. Robertson, and T. Minka. 2008. Soft-Rank: Optimizing non-smooth rank metrics. In Proceedings of the International Conference on Web Search and Web Data Mining. Association for Computing Machinery, New York, 77--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. L. Plackett. 1975. The analysis of permutations. Applied Statistics, 24. 2, 193--202.Google ScholarGoogle ScholarCross RefCross Ref
  31. A. Stolcke. 2002. SRILM - An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing. ISCA, 901--904.Google ScholarGoogle Scholar
  32. X. He, M. Yang, J. Gao, P. Nguyen, and R. Moore. 2008. Indirect-HMM-based hypothesis alignment for combining outputs from machine translation systems. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 98--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Li, M. Wang, H. Li, and F. Xu. 2016. Modeling monolingual character alignment for automatic evaluation of Chinese translation. ACM Transactions on Asian and Low‐Resource Language Information Processing, 15. 3, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19. 263--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bingio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1724--1734.Google ScholarGoogle Scholar
  36. T. Mikolov, M. Karafiat, L. Burget, J. H. Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association. International Speech Communication Association, 1045--1048.Google ScholarGoogle Scholar
  37. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Chen and H. Guo. 2015. Representation based translation evaluation metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 150--155.Google ScholarGoogle Scholar
  39. O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, M. Huck, C. Hokamp, P. Koehn, V. Logacheva, C. Monz, M. Negri, M. Post, C. Scarton, L. Specia, and M. Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 1--46.Google ScholarGoogle Scholar
  40. M. Denkowski and A. Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 376--380.Google ScholarGoogle Scholar
  41. R. Gupta, C. Orasan, and J. van Genabith. 2015. Machine translation evaluation using recurrent neural networks. In Proceedings of the 10th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 380--384.Google ScholarGoogle ScholarCross RefCross Ref
  42. O. Bojar, Y. Graham, A. Kamran, and M. Stanojević. 2016. Results of the WMT16 metrics shared task. In Proceedings of the 1st Conference on Machine Translation. Association for Computational Linguistics, 232--238.Google ScholarGoogle Scholar
  43. P. Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 388--395.Google ScholarGoogle Scholar
  44. Y. Zhang and S. Vogel. 2010. Significance tests of automatic machine translation evaluation metrics. Machine Translation, 24, 1, 51--65. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing Automatic Evaluation of Machine Translation with the ListMLE Approach

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Article Metrics

          • Downloads (Last 12 months)30
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!