skip to main content
note

An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus

Published:14 October 2016Publication History
Skip Abstract Section

Abstract

Manually constructing an annotated Named Entity (NE) in a bilingual corpus is a time-consuming, labor--intensive, and expensive process, but this is necessary for natural language processing (NLP) tasks such as cross-lingual information retrieval, cross-lingual information extraction, machine translation, etc. In this article, we present an automatic approach to construct an annotated NE in English-Vietnamese bilingual corpus from a bilingual parallel corpus by proposing an aligned NE method. Basing this corpus on a bilingual corpus in which the initial NEs are extracted from its own language separately, the approach tries to correct unrecognized NEs or incorrectly recognized NEs before aligning the NEs by using a variety of bilingual constraints. The generated corpus not only improves the NE recognition results but also creates alignments between English NEs and Vietnamese NEs, which are necessary for training NE translation models. The experimental results show that the approach outperforms the baseline methods effectively. In the English-Vietnamese NE alignment task, the F-measure increases from 68.58% to 79.77%. Thanks to the improvement of the NE recognition quality, the proposed method also increases significantly: the F-measure goes from 84.85% to 88.66% for the English side and from 75.71% to 85.55% for the Vietnamese side. By providing the additional semantic information for the machine translation systems, the BLEU score increases from 33.04% to 45.11%.

References

  1. Y. Al-Onaizan and K. Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proceedings of ACL-2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Wanxiang Che, Mengqiu Wang, Christopher D. Manning, and Ting Liu. 2013. Named entity recognition with bilingual constraints. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).Google ScholarGoogle Scholar
  4. Yufeng Chen, Chengqing Zong, and S. U. Keh-Yih. 2010. On jointly recognizing and aligning bilingual named entities. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), 631--639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yufeng Chen, Chengqing Zong, and S. U. Keh-Yih. 2013. A joint model to identify and align bilingual named entities. Computational Linguistics 39, 2, 229--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Cherry and D. Lin. 2003. A probability model to improve word alignment. In Proceedings of ACL-2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford core NLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations.Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Dien and V. Thuy. 2006. A maximum entropy approach for Vietnamese word segmentation. In Proceedings of 4th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future 2006 (RIVF’06).Google ScholarGoogle Scholar
  9. Donghui Feng, Lv. Yajuan, and Ming Zhou. 2004. A new approach for English-Chinese named entity alignment. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar
  10. J. Hobbs. 1996. FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural Language Text. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  11. F. Huang, S. Vogel, and A. Waibel. 2003. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceedings of the Workshop on Multi-lingual and Mixed-language NER (ACL 2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Huang and K. Choi. 2000. Chinese-Korean word alignment based on linguistic comparison. In Proceedings of ACL-2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ngo Quoc Hung and Dien Dinh. 2014. Building English-Vietnamese named entity corpus with aligned bilingual news articles. In Proceedings of the 5th Workshop on South and Southeast Asian Natural Language Processing.Google ScholarGoogle Scholar
  14. S. J. Ker and J. S. Chang. 1997. A class-based approach to word alignment. Computational Linguistics 23, 2, 313--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). 388--395.Google ScholarGoogle Scholar
  16. C. Lee and J. S. Chang. 2003. Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts. In Proceedings of the 2003 Workshop on Data Driven MT HLT-NAACL. 96--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Lee, J. S. Chang, and J. R. Jang. 2006. Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources. ACM Transactions on Asian Language Information Processing (TALIP) 5, 2, 121--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. D. Melamed. 2000. Models of translation equivalence among words. Computational Linguistics 26, 2, 221--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. C. Moore. 2003. Learning translations of named-entity phrases from parallel corpora. In Proceedings of EACL-2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. J. Och and H. Ney. 2000. Improved statistical alignment models. In Proceedings of ACL 2000, 440--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 19--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fu. Ruiji, Qin Bing, and Liu Ting. 2011. Generating Chinese named entity data from parallel corpora. Frontiers of Computer Science 8, 4, 629--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Vogel, H. Ney, and C. Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of COLING 1996, 836--841. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mengqiu Wang, Wanxiang Che, and Christopher D. Manning. 2013. Effective bilingual constraints for semi-supervised learning of named entity recognizers. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Wang, M. Zhou, J. Huang, and C. Huang. 2002. Structural alignment using bilingual chunking. In Proceedings of COLING-2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!